Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1755681
  • 博文数量: 335
  • 博客积分: 4690
  • 博客等级: 上校
  • 技术积分: 4341
  • 用 户 组: 普通用户
  • 注册时间: 2010-05-08 21:38
个人简介

无聊之人--除了技术,还是技术,你懂得

文章分类

全部博文(335)

文章存档

2016年(29)

2015年(18)

2014年(7)

2013年(86)

2012年(90)

2011年(105)

分类: Python/Ruby

2011-08-23 21:26:19

8.1. Diving in

I often see questions on  like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of myHTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions.

我在com.lang.python上面经常看到这样的问题:我应该如何做才能把html文件中的headersimageslinks,都显示出来?我应该如何解析(转换,或是,毁坏)(MUNGe ,"MASH UNTIL NO GOOD)出我的html文档然后单独留下文档?或是如何在我的html文档中一次的增加(移除,引用)标签的属性?本章将会解决所有的这些问题。

Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the doc strings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how any of these class methods ever get called. Don't worry, all will be revealed in due time.

这个完整的功能齐全的Python程序包含两个部分。第一个部分,BaseHTMLProcessor.py,是一个通用的帮助你解决处理HTML文件的工具,它通过遍历文件中的标签和文本块实现。第二部分,dialect.py,举例展示了如何使用BaseHTMLProcessor.py来将html文件转换出文本文件,但是仅留下标签。请阅读doc string和注释来整体的掌握程序的功能。其中很大的一部分看起来都像黑色魔术,因为它们没有很明显的说明这些类方法是如何被调用。不用担心,在合适的时候,所有的东西都会显现出来。

Example 8.1. BaseHTMLProcessor.py

8. 1  BaseHTMLProcessor.py

If you have not already done so, you can  used in this book.

如果你现在都还没有下载,你可以以去 下载本书上用到的代码。

  1. from sgmllib import SGMLParser
  2. import htmlentitydefs
  3. class BaseHTMLProcessor(SGMLParser):
  4.     def reset(self):
  5.         # extend (called by SGMLParser.__init__)
  6.         self.pieces = []
  7.         SGMLParser.reset(self)
  8.  
  9.     def unknown_starttag(self, tag, attrs):
  10.         # called for each start tag
  11.         # attrs is a list of (attr, value) tuples
  12.         # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
  13.         # Ideally we would like to reconstruct original tag and attributes, but
  14.         # we may end up quoting attribute values that weren't quoted in the source
  15.         # document, or we may change the type of quotes around the attribute value
  16.         # (single to double quotes).
  17.         # Note that improperly embedded non-HTML code (like client-side Javascript)
  18.         # may be parsed incorrectly by the ancestor, causing runtime script errors.
  19.         # All non-HTML code must be enclosed in HTML comment tags ()
  20.         # to ensure that it will pass through this parser unaltered (in handle_comment).
  21.         strattrs = "".join([' %s="%s"

Example 8.2. dialect.py

8.2dialect.py

  1. import re
  2. from BaseHTMLProcessor import BaseHTMLProcessor
  3.  
  4. class Dialectizer(BaseHTMLProcessor):
  5.     subs = ()
  6.  
  7.     def reset(self):
  8.         # extend (called from __init__ in ancestor)
  9.         # Reset all data attributes
  10.         self.verbatim = 0
  11.         BaseHTMLProcessor.reset(self)
  12.  
  13.     def start_pre(self, attrs):
  14.         # called for every <pre> tag in HTML source
  15.         # Increment verbatim mode count, then handle tag like normal
  16.         self.verbatim += 1
  17.         self.unknown_starttag("pre", attrs)
  18.  
  19.     def end_pre(self):
  20.         # called for every </pre> tag in HTML source
  21.         # Decrement verbatim mode count
  22.         self.unknown_endtag("pre")
  23.         self.verbatim -= 1
  24.  
  25.     def handle_data(self, text):
  26.         # override
  27.         # called for every block of text in HTML source
  28.         # If in verbatim mode, save text unaltered;
  29.         # otherwise process the text with a series of substitutions
  30.         self.pieces.append(self.verbatim and text or self.process(text))
  31.  
  32.     def process(self, text):
  33.         # called from handle_data
  34.         # Process text block by performing series of regular expression
  35.         # substitutions (actual substitions are defined in descendant)
  36.         for fromPattern, toPattern in self.subs:
  37.             text = re.sub(fromPattern, toPattern, text)
  38.         return text
  39.  
  40. class ChefDialectizer(Dialectizer):
  41.     """convert HTML to Swedish Chef-speak
  42.    
  43.     based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
  44.     """
  45.     subs = ((r'a([nu])', r'u\1'),
  46.             (r'A([nu])', r'U\1'),
  47.             (r'a\B', r'e'),
  48.             (r'A\B', r'E'),
  49.             (r'en\b', r'ee'),
  50.             (r'\Bew', r'oo'),
  51.             (r'\Be\b', r'e-a'),
  52.             (r'\be', r'i'),
  53.             (r'\bE', r'I'),
  54.             (r'\Bf', r'ff'),
  55.             (r'\Bir', r'ur'),
  56.             (r'(\w*?)i(\w*?)$', r'\1ee\2'),
  57.             (r'\bow', r'oo'),
  58.             (r'\bo', r'oo'),
  59.             (r'\bO', r'Oo'),
  60.             (r'the', r'zee'),
  61.             (r'The', r'Zee'),
  62.             (r'th\b', r't'),
  63.             (r'\Btion', r'shun'),
  64.             (r'\Bu', r'oo'),
  65.             (r'\BU', r'Oo'),
  66.             (r'v', r'f'),
  67.             (r'V', r'F'),
  68.             (r'w', r'w'),
  69.             (r'W', r'W'),
  70.             (r'([a-z])[.]', r'\1. Bork Bork Bork!'))
  71.  
  72. class FuddDialectizer(Dialectizer):
  73.     """convert HTML to Elmer Fudd-speak"""
  74.     subs = ((r'[rl]', r'w'),
  75.             (r'qu', r'qw'),
  76.             (r'th\b', r'f'),
  77.             (r'th', r'd'),
  78.             (r'n[.]', r'n, uh-hah-hah-hah.'))
  79.  
  80. class OldeDialectizer(Dialectizer):
  81.     """convert HTML to mock Middle English"""
  82.     subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
  83.             (r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
  84.             (r'ick\b', r'yk'),
  85.             (r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
  86.             (r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
  87.             (r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
  88.             (r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
  89.             (r'([aeiou])re\b', r'\1r'),
  90.             (r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
  91.             (r'tion\b', r'cioun'),
  92.             (r'ion\b', r'ioun'),
  93.             (r'aid', r'ayde'),
  94.             (r'ai', r'ey'),
  95.             (r'ay\b', r'y'),
  96.             (r'ay', r'ey'),
  97.             (r'ant', r'aunt'),
  98.             (r'ea', r'ee'),
  99.             (r'oa', r'oo'),
  100.             (r'ue', r'e'),
  101.             (r'oe', r'o'),
  102.             (r'ou', r'ow'),
  103.             (r'ow', r'ou'),
  104.             (r'\bhe', r'hi'),
  105.             (r've\b', r'veth'),
  106.             (r'se\b', r'e'),
  107.             (r"'s\b", r'es'),
  108.             (r'ic\b', r'ick'),
  109.             (r'ics\b', r'icc'),
  110.             (r'ical\b', r'ick'),
  111.             (r'tle\b', r'til'),
  112.             (r'll\b', r'l'),
  113.             (r'ould\b', r'olde'),
  114.             (r'own\b', r'oune'),
  115.             (r'un\b', r'onne'),
  116.             (r'rry\b', r'rye'),
  117.             (r'est\b', r'este'),
  118.             (r'pt\b', r'pte'),
  119.             (r'th\b', r'the'),
  120.             (r'ch\b', r'che'),
  121.             (r'ss\b', r'sse'),
  122.             (r'([wybdp])\b', r'\1e'),
  123.             (r'([rnt])\b', r'\1\1e'),
  124.             (r'from', r'fro'),
  125.             (r'when', r'whan'))
  126.  
  127. def translate(url, dialectName="chef"):
  128.     """fetch URL and translate using dialect
  129.    
  130.     dialect in ("chef", "fudd", "olde")"""
  131.     import urllib
  132.     sock = urllib.urlopen(url)
  133.     htmlSource = sock.read()
  134.     sock.close()
  135.     parserName = "%sDialectizer" % dialectName.capitalize()
  136.     parserClass = globals()[parserName]
  137.     parser = parserClass()
  138.     parser.feed(htmlSource)
  139.     parser.close()
  140.     return parser.output()
  141.  
  142. def test(url):
  143.     """test all dialects against URL"""
  144.     for dialect in ("chef", "fudd", "olde"):
  145.         outfile = "%s.html" % dialect
  146.         fsock = open(outfile, "wb")
  147.         fsock.write(translate(url, dialect))
  148.         fsock.close()
  149.         import webbrowser
  150.         webbrowser.open_new(outfile)
  151.  
  152. if __name__ == "__main__":
  153.     test("")

Example 8.3. Output of dialect.py

8.3 dialect.py的输出

Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the code listings and screen examples were left untouched.

运行这个脚本将会转换 Section 3.2, “Introducing Lists”mock Swedish Chef-speak, mock Elmer Fudd-speak(来源于Bugs,Bunny,cartoons)以及 mock Middle English(或许基于Chaucer's The Canterbury Tales).如果你自己查看输出内容的html源代码,你会发现所有的html标签和属性都没有遭到破坏,但是标签内的文本都被转换成mock 语言。事实上如果你仔细看,只有文件的标题段落被转换,下面被展示的代码和例子没有被转换。(这段很晦涩,等全部阅读完这章,在重新修订一下)

  1. <div class="abstract">
  2. <p>Lists awe <span class="application">Pydon</span>

                                

阅读(1168) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~