无聊之人--除了技术,还是技术,你懂得
分类: Python/Ruby
2011-08-24 19:29:03
8.2. Introducing sgmllib.py
HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.
解析html的过程可以分为3步:将html分解成连续的片段,获取你想要的片段,将片段重新组装成Html。第一步可以通过sgmllib.py来完成,它是标准Python库的一部分。
The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.pypresents HTML structurally.
理解这章的关键是意识到html不仅仅是文本,它是结构化的文本。这种结构是从或多或少具有层次序列的开始和结束标签中衍生出来的。通常你不以这种方式使用html:你逐字得在文本编辑器中,或是以可视化的方式在web浏览器中或是web授权工具中对html进行操作。Sgmllib.py能将html结构化的展现出来。
sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.
Sgmllib.py里面含有一个非常重要的类:SGMLParser. SGMLParser.会将html解析成片段,如开始标签和结束标签。一旦它成功的将某些数据解析成一个有用的片段,它会基于它发现的片段来调用自身的某个方法。为了使用Parser,你需要继承SGMLParser,然后重写这些方法。我说它结构化的展示html,我的意思是:html的结构决定了方法调用的序列一节传递给每一个方法的序列。
SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:
SGMLParser能将hmtl解析成8种数据:然后对分别对它们调用单独的方法。
Start tag
An HTML tag that starts a block, like , , , or , or a standalone tag
like
or . When it finds a start
tag tagname, SGMLParser will look for a
method called start_tagname or do_tagname. For instance, when it finds a tag, it will look for a start_pre or do_pre method. If
found, SGMLParser calls this method with a list of the tag's
attributes; otherwise, it calls unknown_starttag with the tag name
and list of attributes.
开始标签:
一个html标签它标志着一个块的开始,如,,,或是一个独立的标签如
,,当它发现一个名为tagname的开始标签时,SGMLParser会查找一个start_pre,或是do_pre方法。如果发现,SGMLParser将会使用带有标签属性的列表做参数来调用该方法;否则,它将会使用属性列表做参数来调用unknown_starttag方法。
End tag
An HTML tag that ends a block, like , , , or . When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParsercalls this method, otherwise it calls unknown_endtag with the tag name.
结束标签:
它标着着html块的结束,如,,,.当它发现一个结束标签的时候,SGMLParser会查找名为end_tagname的方法。如果发现,SGMLParser会调用该方法,否则将使用标签名做参数来调用unknown_tag方法。
Character reference
An escaped character referenced by its decimal or hexadecimal equivalent, like . When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent.
字符引用
一个转移字符会被十进制或等价的是十六进制引用,如 .当SGMLparser发现该标签时,SGMLPar会用改文本等价的十进制或是十六进制形式来调用handler_charref方法。
Entity reference
An HTML entity, like ©. When found, SGMLParser calls handle_entityref with the name of the HTML entity.
实体引用
一个html实体,如©当SGMLParser发现它的时候,它使用实体名字做参数来调用handle_entityref方法。
Comment
An HTML comment, enclosed in . When found, SGMLParser calls handle_comment with the body of the comment.
注释:
一个html结束,被包围在