Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1789031
  • 博文数量: 335
  • 博客积分: 4690
  • 博客等级: 上校
  • 技术积分: 4341
  • 用 户 组: 普通用户
  • 注册时间: 2010-05-08 21:38
个人简介

无聊之人--除了技术,还是技术,你懂得

文章分类

全部博文(335)

文章存档

2016年(29)

2015年(18)

2014年(7)

2013年(86)

2012年(90)

2011年(105)

分类: Python/Ruby

2011-08-26 17:36:58

8.4. Introducing BaseHTMLProcessor.py

SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it finds, but the methods don't do anything.SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer.

SGMLparser本身不产生任何内容。它只是解析,解析,解析,然后针对它所发现的每一个有趣的事情调用一个方法,但是方法也不做任何事。SGMLParser是一个html的消费者,它接受html作为参数,然后将它分解成很小的,结构化的片段。正如你在前一部分看到的那样,你可以子类化SGMLParser类来定义一个子类来捕捉特殊的标签,然后产生有用的内容,如一个网页上面的链接所构成的列表。个现在通过定义类来继续下一步,来捕捉SGMLParser所抛出的内容,然后重构成一个完整的html文档。使用技术术语的话来说,这个类是一个html生产者。

BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttagunknown_endtaghandle_charrefhandle_entityrefhandle_commenthandle_pi,handle_decl, and handle_data.

BaseHTMLProcessor类继承自类SGMLParser,然后它提供了8个核心的处理方法:unknown_starttagunknown_endtaghandle_charrefhandle_entityrefhandle_commenthandle_pi,handle_decl,  handle_data.

Example 8.8. Introducing BaseHTMLProcessor

8.8 BaseHTMLProcessor简介

  1. class BaseHTMLProcessor(SGMLParser):
  2.     def reset(self):
  3.         self.pieces = []
  4.         SGMLParser.reset(self)
  5.  
  6.     def unknown_starttag(self, tag, attrs):
  7.         strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
  8.         self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
  9.  
  10.     def unknown_endtag(self, tag):
  11.         self.pieces.append("" % locals())
  12.  
  13.     def handle_charref(self, ref):
  14.         self.pieces.append("&#%(ref)s;" % locals())
  15.  
  16.     def handle_entityref(self, ref):
  17.         self.pieces.append("&%(ref)s" % locals())
  18.         if htmlentitydefs.entitydefs.has_key(ref):
  19.             self.pieces.append(";")
  20.  
  21.     def handle_data(self, text):
  22.         self.pieces.append(text)
  23.  
  24.     def handle_comment(self, text):
  25.         self.pieces.append("" % locals())
  26.  
  27.     def handle_pi(self, text):
  28.         self.pieces.append("" % locals())
  29.  
  30.     def handle_decl(self, text):
  31.         self.pieces.append("" % locals())

1

reset, called by SGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string toself.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but Python is much more efficient at dealing with lists.[2]s

SGMLParser.__init__方法调用reset方法,在调用祖先类的方法之前将self.pieces 初始化成一个空列表。Self.piece是一个数据属性(实例变量),它将保存你准备重构的html的片段。每一而处理方法都会重构SGMLParser解析出来的偏度使其成为一个新的html,并且每一个方法都将解析出的字符串追加到self.piece上面去。注意:self.pieces是一个列表。你或许尝试将它定义为一个字符串,然后对字符串进行追加。这同样是可行的,但是Python对列表的处理效率要要远远高于字符串。

 

 

2

Since BaseHTMLProcessor does not define any methods for specific tags (like the start_a method in URLLister), SGMLParser will call unknown_starttag for every start tag. This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces. The string formatting here is a little strange; you'll untangle that (and also the odd-looking locals function) later in this chapter.

因为BaseHTMLProcessor没有定义任何方法来处理特殊的标签(如URLlist中的start_a,SGMLParser就会读每一个start tag 调用unknown_starttag.该方法以tag作为参数,然后以属性名/属性值对的形式返回属性列表attrs,重构最初的html文档,最后将它追加到self.pieces。这里的字符串格式化略为有点奇怪:在本章的后面你将掌握它(同样还有看起来很奇怪的局部函数)。

 

 

3

Reconstructing end tags is much simpler; just take the tag name and wrap it in the  brackets.

重构endtag就很简单了:即接受参数名,然后将其包装在<>之中。

 

 

4

When SGMLParser finds a character reference, it calls handle_charref with the bare reference. If the HTML document contains the reference  , ref will be 160. Reconstructing the original complete character reference just involves wrapping ref in &#...; characters.

 

 

5

Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference requires wrapping ref in &...; characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called htmlentitydefs. Hence the extra ifstatement.)

实体引用同字符引用是类似的,但是没有哈希标志。重构最初的实体引用要求使用&…;来包装该引用。(实际上,正如资深的读者曾今指出的那样,实际的情况略为比这更加复杂一点:只有标准的html实体以分号结束,其它看起来同实体类似的并不是这样。对我们而言幸运的是,标准的html实体集被定义在Python模块,它是htmlentitydefs。因此其它的就可以通过if语句来处理。)

 

 

6

Blocks of text are simply appended to self.pieces unaltered.

文本块被简单的不加改变的追加到self.pieces后面

 

 

7

HTML comments are wrapped in  characters.

Html注释被包装在字符.之间

 

 

8

Processing instructions are wrapped in  characters.

处理指令被包装在字符之间.

 

 

 

Important

 

 

The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it wereHTML. For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes. SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script within HTML comments.

Html说明书要求那些非html(如客户端的javascript)必须封装在html注释内,但是并不是所有的网页处理的都是恰当的(而且如果它们没有这么做,现代的浏览器都会忽略它)。BaseHTMLProcessor并没有忽略掉:如果脚本嵌入的不合适,它将被像html一样被解析。比如,如果脚本包小于或是等于号,SGMLParser或许认为没有发现标签或是属性。SGMLParser通常将属性和标枪装换成小写形式,这将损坏脚本,BaseHTMLProcessor通常使用双引号来封装 属性值(即使最初的html使用单引号抑或是没有使用引号),这肯定会损坏脚本。通常你需要保护html注释内的客户端脚本。

 

 

Example 8.9. BaseHTMLProcessor output

8.9 BaseHTMLProcessor的输出

  1. def output(self):
  2.         """Return processed HTML as a single string"""
  3.         return "".join(self.pieces)

1

This is the one method in BaseHTMLProcessor that is never called by the ancestor SGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it.

这里的这个方法是BaseHTMLProcessor所包含的方法中,唯一一个没有被祖先类SGMLParser调用的方法。因为其他的处理方法都已经把它们重构后的html存储在self.pieces内,这个的函数的功能就是将这些片段拼接成一个字符串。正如前面所说的那样,Python擅长处理列表,而在字符串处理方面效率一般,因此只有在别人显式的要求你创建字符串的时候你才创建字符串。

2

If you prefer, you could use the join method of the string module instead: string.join(self.pieces, "")

如果你喜欢,你可以使用字符串模块的join方法来替换:string.join(self.pieces).


阅读(1428) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~