无聊之人--除了技术,还是技术,你懂得
分类: Python/Ruby
2011-08-26 17:36:58
8.4. Introducing BaseHTMLProcessor.py SGMLParser doesn't produce anything by itself. It parses and
parses and parses, and it calls a method for each interesting thing it finds,
but the methods don't do anything.SGMLParser is
an HTML consumer: it takes HTML and breaks it down
into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things,
like a list of all the links on a web page. Now you'll take this one step
further by defining a class that catches everything SGMLParser throws at it and reconstructs the
complete HTML document. In technical terms, this class will be
an HTML producer. SGMLparser本身不产生任何内容。它只是解析,解析,解析,然后针对它所发现的每一个有趣的事情调用一个方法,但是方法也不做任何事。SGMLParser是一个html的消费者,它接受html作为参数,然后将它分解成很小的,结构化的片段。正如你在前一部分看到的那样,你可以子类化SGMLParser类来定义一个子类来捕捉特殊的标签,然后产生有用的内容,如一个网页上面的链接所构成的列表。个现在通过定义类来继续下一步,来捕捉SGMLParser所抛出的内容,然后重构成一个完整的html文档。使用技术术语的话来说,这个类是一个html生产者。 BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi,handle_decl, and handle_data. BaseHTMLProcessor类继承自类SGMLParser,然后它提供了8个核心的处理方法:unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi,handle_decl, 和 handle_data. Example 8.8. Introducing BaseHTMLProcessor 例8.8 BaseHTMLProcessor简介 reset, called
by SGMLParser.__init__, initializes self.pieces as an empty
list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of
the HTML document you're constructing. Each handler method will
reconstruct the HTML that SGMLParser parsed, and each
method will append that string toself.pieces. Note that self.pieces is
a list. You might be tempted to define it as a string and just keep appending
each piece to it. That would work, but Python is much more
efficient at dealing with lists.[2]s SGMLParser.__init__方法调用reset方法,在调用祖先类的方法之前将self.pieces 初始化成一个空列表。Self.piece是一个数据属性(实例变量),它将保存你准备重构的html的片段。每一而处理方法都会重构SGMLParser解析出来的偏度使其成为一个新的html,并且每一个方法都将解析出的字符串追加到self.piece上面去。注意:self.pieces是一个列表。你或许尝试将它定义为一个字符串,然后对字符串进行追加。这同样是可行的,但是Python对列表的处理效率要要远远高于字符串。 Since BaseHTMLProcessor does
not define any methods for specific tags (like the start_a method
in URLLister), SGMLParser will
call unknown_starttag for every start tag. This method takes the
tag (tag) and the list of attribute name/value pairs (attrs), reconstructs
the original HTML, and appends it to self.pieces. The string
formatting here is a little strange; you'll untangle that (and also the
odd-looking locals function) later in this chapter. 因为BaseHTMLProcessor没有定义任何方法来处理特殊的标签(如URLlist中的start_a),SGMLParser就会读每一个start tag 调用unknown_starttag.该方法以tag作为参数,然后以属性名/属性值对的形式返回属性列表attrs,重构最初的html文档,最后将它追加到self.pieces。这里的字符串格式化略为有点奇怪:在本章的后面你将掌握它(同样还有看起来很奇怪的局部函数)。 Reconstructing end tags is much
simpler; just take the tag name and wrap it in
the brackets. 重构endtag就很简单了:即接受参数名,然后将其包装在<>之中。 When SGMLParser finds a
character reference, it calls handle_charref with the bare
reference. If the HTML document contains the
reference , ref will be 160. Reconstructing the
original complete character reference just involves
wrapping ref in ...; characters. Entity references are similar to
character references, but without the hash mark. Reconstructing the original
entity reference requires
wrapping ref in &...; characters. (Actually, as an
erudite reader pointed out to me, it's slightly more complicated than this.
Only certain standard HTML entites end in a semicolon; other
similar-looking entities do not. Luckily for us, the set of
standard HTML entities is defined in a dictionary in a Python module
called htmlentitydefs. Hence the extra ifstatement.) 实体引用同字符引用是类似的,但是没有哈希标志。重构最初的实体引用要求使用&…;来包装该引用。(实际上,正如资深的读者曾今指出的那样,实际的情况略为比这更加复杂一点:只有标准的html实体以分号结束,其它看起来同实体类似的并不是这样。对我们而言幸运的是,标准的html实体集被定义在Python模块,它是htmlentitydefs。因此其它的就可以通过if语句来处理。) Blocks of text are simply appended
to self.pieces unaltered. 文本块被简单的不加改变的追加到self.pieces后面 HTML comments are wrapped
in characters. Html注释被包装在字符.之间 Processing instructions are wrapped in ...> characters. 处理指令被包装在字符....>之间.
The HTML specification
requires that all non-HTML (like client-side JavaScript) must be
enclosed in HTML comments, but not all web pages do this properly
(and all modern web browsers are forgiving if they don't). BaseHTMLProcessor is
not forgiving; if script is improperly embedded, it will be parsed as if it
wereHTML. For instance, if the script contains less-than and equals
signs, SGMLParser may incorrectly think that it has found tags and
attributes. SGMLParser always converts tags and attribute names to lowercase,
which may break the script, and BaseHTMLProcessor always encloses
attribute values in double quotes (even if the
original HTML document used single quotes or no quotes), which will
certainly break the script. Always protect your client-side script
within HTML comments. Html说明书要求那些非html(如客户端的javascript)必须封装在html注释内,但是并不是所有的网页处理的都是恰当的(而且如果它们没有这么做,现代的浏览器都会忽略它)。BaseHTMLProcessor并没有忽略掉:如果脚本嵌入的不合适,它将被像html一样被解析。比如,如果脚本包小于或是等于号,SGMLParser或许认为没有发现标签或是属性。SGMLParser通常将属性和标枪装换成小写形式,这将损坏脚本,BaseHTMLProcessor通常使用双引号来封装 属性值(即使最初的html使用单引号抑或是没有使用引号),这肯定会损坏脚本。通常你需要保护html注释内的客户端脚本。 Example 8.9. BaseHTMLProcessor output 例8.9 BaseHTMLProcessor的输出 This is the one method
in BaseHTMLProcessor that is never called by the
ancestor SGMLParser. Since the other handler methods store their
reconstructed HTML in self.pieces, this function is needed to join
all those pieces into one string. As noted before, Python is great
at lists and mediocre at strings, so you only create the complete string when
somebody explicitly asks for it. 这里的这个方法是BaseHTMLProcessor所包含的方法中,唯一一个没有被祖先类SGMLParser调用的方法。因为其他的处理方法都已经把它们重构后的html存储在self.pieces内,这个的函数的功能就是将这些片段拼接成一个字符串。正如前面所说的那样,Python擅长处理列表,而在字符串处理方面效率一般,因此只有在别人显式的要求你创建字符串的时候你才创建字符串。 If you prefer, you could use
the join method of the string module
instead: string.join(self.pieces, "") 如果你喜欢,你可以使用字符串模块的join方法来替换:string.join(self.pieces).