8.1. Diving in
I often see questions on like “How can I list all
the [headers|images|links] in my HTML document?” “How do I
parse/translate/munge the text of myHTML document but leave the tags
alone?” “How can I add/remove/quote attributes of all my HTML tags at
once?” This chapter will answer all of these questions.
我在com.lang.python上面经常看到这样的问题:我应该如何做才能把html文件中的headers,images,links,都显示出来?我应该如何解析(转换,或是,毁坏)(MUNGe ,"MASH UNTIL NO GOOD)出我的html文档然后单独留下文档?或是如何在我的html文档中一次的增加(移除,引用)标签的属性?本章将会解决所有的这些问题。
Here is a complete, working Python program in
two parts. The first part, BaseHTMLProcessor.py, is a generic tool to
help you process HTML files by walking through the tags and text
blocks. The second part, dialect.py, is an example of how
to use BaseHTMLProcessor.py to translate the text of an HTML document
but leave the tags alone. Read the doc strings and comments to get an overview of what's going on. Most of it will seem
like black magic, because it's not obvious how any of these class methods ever
get called. Don't worry, all will be revealed in due time.
这个完整的功能齐全的Python程序包含两个部分。第一个部分,BaseHTMLProcessor.py,是一个通用的帮助你解决处理HTML文件的工具,它通过遍历文件中的标签和文本块实现。第二部分,dialect.py,举例展示了如何使用BaseHTMLProcessor.py来将html文件转换出文本文件,但是仅留下标签。请阅读doc string和注释来整体的掌握程序的功能。其中很大的一部分看起来都像黑色魔术,因为它们没有很明显的说明这些类方法是如何被调用。不用担心,在合适的时候,所有的东西都会显现出来。
Example 8.1. BaseHTMLProcessor.py
例8. 1 BaseHTMLProcessor.py
If you have not already done so, you can used in this book.
如果你现在都还没有下载,你可以以去 下载本书上用到的代码。
- from sgmllib import SGMLParser
-
import htmlentitydefs
-
class BaseHTMLProcessor(SGMLParser):
-
def reset(self):
-
# extend (called by SGMLParser.__init__)
-
self.pieces = []
-
SGMLParser.reset(self)
-
-
def unknown_starttag(self, tag, attrs):
-
# called for each start tag
-
# attrs is a list of (attr, value) tuples
-
# e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
-
# Ideally we would like to reconstruct original tag and attributes, but
-
# we may end up quoting attribute values that weren't quoted in the source
-
# document, or we may change the type of quotes around the attribute value
-
# (single to double quotes).
-
# Note that improperly embedded non-HTML code (like client-side Javascript)
-
# may be parsed incorrectly by the ancestor, causing runtime script errors.
-
# All non-HTML code must be enclosed in HTML comment tags ()
-
# to ensure that it will pass through this parser unaltered (in handle_comment).
-
strattrs = "".join([' %s="%s"
Example 8.2. dialect.py
例8.2dialect.py
- import re
-
from BaseHTMLProcessor import BaseHTMLProcessor
-
-
class Dialectizer(BaseHTMLProcessor):
-
subs = ()
-
-
def reset(self):
-
# extend (called from __init__ in ancestor)
-
# Reset all data attributes
-
self.verbatim = 0
-
BaseHTMLProcessor.reset(self)
-
-
def start_pre(self, attrs):
-
# called for every <pre> tag in HTML source
-
# Increment verbatim mode count, then handle tag like normal
-
self.verbatim += 1
-
self.unknown_starttag("pre", attrs)
-
-
def end_pre(self):
-
# called for every </pre> tag in HTML source
-
# Decrement verbatim mode count
-
self.unknown_endtag("pre")
-
self.verbatim -= 1
-
-
def handle_data(self, text):
-
# override
-
# called for every block of text in HTML source
-
# If in verbatim mode, save text unaltered;
-
# otherwise process the text with a series of substitutions
-
self.pieces.append(self.verbatim and text or self.process(text))
-
-
def process(self, text):
-
# called from handle_data
-
# Process text block by performing series of regular expression
-
# substitutions (actual substitions are defined in descendant)
-
for fromPattern, toPattern in self.subs:
-
text = re.sub(fromPattern, toPattern, text)
-
return text
-
-
class ChefDialectizer(Dialectizer):
-
"""convert HTML to Swedish Chef-speak
-
-
based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
-
"""
-
subs = ((r'a([nu])', r'u\1'),
-
(r'A([nu])', r'U\1'),
-
(r'a\B', r'e'),
-
(r'A\B', r'E'),
-
(r'en\b', r'ee'),
-
(r'\Bew', r'oo'),
-
(r'\Be\b', r'e-a'),
-
(r'\be', r'i'),
-
(r'\bE', r'I'),
-
(r'\Bf', r'ff'),
-
(r'\Bir', r'ur'),
-
(r'(\w*?)i(\w*?)$', r'\1ee\2'),
-
(r'\bow', r'oo'),
-
(r'\bo', r'oo'),
-
(r'\bO', r'Oo'),
-
(r'the', r'zee'),
-
(r'The', r'Zee'),
-
(r'th\b', r't'),
-
(r'\Btion', r'shun'),
-
(r'\Bu', r'oo'),
-
(r'\BU', r'Oo'),
-
(r'v', r'f'),
-
(r'V', r'F'),
-
(r'w', r'w'),
-
(r'W', r'W'),
-
(r'([a-z])[.]', r'\1. Bork Bork Bork!'))
-
-
class FuddDialectizer(Dialectizer):
-
"""convert HTML to Elmer Fudd-speak"""
-
subs = ((r'[rl]', r'w'),
-
(r'qu', r'qw'),
-
(r'th\b', r'f'),
-
(r'th', r'd'),
-
(r'n[.]', r'n, uh-hah-hah-hah.'))
-
-
class OldeDialectizer(Dialectizer):
-
"""convert HTML to mock Middle English"""
-
subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
-
(r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
-
(r'ick\b', r'yk'),
-
(r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
-
(r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
-
(r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
-
(r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
-
(r'([aeiou])re\b', r'\1r'),
-
(r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
-
(r'tion\b', r'cioun'),
-
(r'ion\b', r'ioun'),
-
(r'aid', r'ayde'),
-
(r'ai', r'ey'),
-
(r'ay\b', r'y'),
-
(r'ay', r'ey'),
-
(r'ant', r'aunt'),
-
(r'ea', r'ee'),
-
(r'oa', r'oo'),
-
(r'ue', r'e'),
-
(r'oe', r'o'),
-
(r'ou', r'ow'),
-
(r'ow', r'ou'),
-
(r'\bhe', r'hi'),
-
(r've\b', r'veth'),
-
(r'se\b', r'e'),
-
(r"'s\b", r'es'),
-
(r'ic\b', r'ick'),
-
(r'ics\b', r'icc'),
-
(r'ical\b', r'ick'),
-
(r'tle\b', r'til'),
-
(r'll\b', r'l'),
-
(r'ould\b', r'olde'),
-
(r'own\b', r'oune'),
-
(r'un\b', r'onne'),
-
(r'rry\b', r'rye'),
-
(r'est\b', r'este'),
-
(r'pt\b', r'pte'),
-
(r'th\b', r'the'),
-
(r'ch\b', r'che'),
-
(r'ss\b', r'sse'),
-
(r'([wybdp])\b', r'\1e'),
-
(r'([rnt])\b', r'\1\1e'),
-
(r'from', r'fro'),
-
(r'when', r'whan'))
-
-
def translate(url, dialectName="chef"):
-
"""fetch URL and translate using dialect
-
-
dialect in ("chef", "fudd", "olde")"""
-
import urllib
-
sock = urllib.urlopen(url)
-
htmlSource = sock.read()
-
sock.close()
-
parserName = "%sDialectizer" % dialectName.capitalize()
-
parserClass = globals()[parserName]
-
parser = parserClass()
-
parser.feed(htmlSource)
-
parser.close()
-
return parser.output()
-
-
def test(url):
-
"""test all dialects against URL"""
-
for dialect in ("chef", "fudd", "olde"):
-
outfile = "%s.html" % dialect
-
fsock = open(outfile, "wb")
-
fsock.write(translate(url, dialect))
-
fsock.close()
-
import webbrowser
-
webbrowser.open_new(outfile)
-
-
if __name__ == "__main__":
-
test("")
Example 8.3. Output of dialect.py
例8.3 dialect.py的输出
Running this script will translate Section 3.2, “Introducing
Lists” into mock
Swedish Chef-speak (from The
Muppets), mock
Elmer Fudd-speak (from Bugs Bunny
cartoons), and mock
Middle English (loosely based on
Chaucer's The Canterbury Tales). If you look at
the HTML source of the output pages, you'll see that all
the HTML tags and attributes are untouched, but the text between the
tags has been “translated” into the mock language. If you look closer, you'll
see that, in fact, only the titles and paragraphs were translated; the code
listings and screen examples were left untouched.
运行这个脚本将会转换 Section 3.2, “Introducing
Lists”为mock
Swedish Chef-speak, mock
Elmer Fudd-speak(来源于Bugs,Bunny,cartoons)以及 mock
Middle English(或许基于Chaucer's The Canterbury Tales)).如果你自己查看输出内容的html源代码,你会发现所有的html标签和属性都没有遭到破坏,但是标签内的文本都被转换成mock 语言。事实上如果你仔细看,只有文件的标题段落被转换,下面被展示的代码和例子没有被转换。(这段很晦涩,等全部阅读完这章,在重新修订一下)
- <div class="abstract">
-
<p>Lists awe <span class="application">Pydon</span>