无聊之人--除了技术,还是技术,你懂得
分类: Python/Ruby
2011-08-30 19:49:29
8.8. Introducing dialect.py
Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of
substitutions, but it makes sure that anything within a ...
block passes through unaltered.
Dialectizer 是一个简单的并且有点愚蠢的的类BaseHTMLProcesso的后代类。它通过一系列的替换来处理一段文本块,但是可以确定的是:被传递过去的标签&
内的任何事情都没有被改变。
To handle the blocks, you define two methods in Dialectizer: start_pre and end_pre.
为了处理块,在类Dialectizer内定义两个方法:
start_pre is called every time SGMLParser finds
a 每当SGMLPare在html文件中发现一个时,它就调用方法start_pre。(稍后,你将会准确的这一切)。该方法接受一个参数attrs,它包含了标签的属性(如果该标签含有属性)。正如unknown_starttag接受的参数一样,Attrs是一个键值对组成的tuple。 |
|
In the reset method, you
initialize a data attribute that serves as a counter for 在reset方法中,你初始化了一个数据属性,它的主要作用是对标签进行计数。每次当你发现一个标签的时候,counter就增加1;每当你发现一个的时候,就对couter减一。(你可以将它视作一个标识,可以设置为0和1,但是你不能简单的这么做。它能处理嵌套标签是奇数次的情况(但是是可能哈))。马上,你将看到counter如何起作用。 |
|
That's it, that's the only special
processing you do for 哦,这就是你在处理标签的时候唯一特殊的地方。同unknown_tag一样,你传一个属性列表,进行默认的处理。 |
|
end_pre is called every time SGMLParser finds a tag. Since end tags can not contain attributes, the method takes no parameters. 每次当SGMLParser发现一个标签的时候,方法end_pre被调用。因为结束标签不能包含属性,该函数不带参数。 |
|
First, you want to do the default processing, just like any other end tag. 首先,同其它的对结束标签的处理一样,你想进行默认的处理过程。 |
|
Second, you decrement your counter to
signal that this 第二,你对counter进行减操作,它标着着快已经结束。 |
At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle and . But how does this happen? Well, it's not magic, it's just goodPython coding.
在此处,对SGMLParser类你需要深研究一下。我重复强调一遍(你要保持住这种信念):SGMLParser对每一个标签查找并调用对应的方法,如果该标签存在。比如,正如你刚才看到的那样,你定义了start_pre&end_pre方法来处理&
标签。但是这是如何实现的呢?恩,这并不神奇,它只是好的Python编码。
Example 8.18. SGMLParser
例8.18
At this point, SGMLParser has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag). 在此处,SGMLParser已经发现了一个开始标签,然后将属性列表作为参数传递过去。该类剩下唯一需要做的事的决定对于这个标签是否存在一个特殊的处理方法或是回到默认的处理方法(unknown_starttag). |
|
The “magic” of SGMLParser is nothing more than your old friend, getattr. What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself. Here the object is self, the current instance. So if tag is 'pre', this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class. SGMLParser最神奇的就是你的老朋友:getattr.先前你可能还没有意识到:getattr能发现定义在后代类以及对象self自身中的方法,这里对象就是类实例本身。因此如果标签时‘pre’,对getattr的调用将会查当前实例的start_pre方法,当前实例即Dialecizer类的实例。 |
|
getattr raises an AttributeError if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped the call togetattr inside a try...except block and explicitly caught the AttributeError. 如果getattr方法查找的方法在对象(或是该对象的所有后代类中)中没有定义,它将抛出AttributeERror异常,但是这是没有问题的,这是因为你将对该方法得调用通过使用try…except代码块封装到了togetattr中了,同时显式的捕捉了异常AttributeError. |
|
Since you didn't find a start_xxx method,
you'll also look for a do_xxx method before giving up. This
alternate naming scheme is generally used for standalone tags, like 因为没有发现start_xxx方法,在放弃查找之前,你将接着查找do_xxx方法。这种可选的命名模式同手适用于可以独立标签,如 |
|
Another AttributeError, which means that the call to getattr failed with do_xxx. Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag. 另一个异常AttributeError,它意味着调用do_xxx方法,调用getattr失败。因为对于一个标签你要么发现一个start_xxx方法,要么查找到一个do_xxx方法,你捕捉到该异常,然后回到默认的处理方法:unknown_starttag. |
|
Remember, try...except blocks can have an else clause, which is called if no exception is raised during the try...except block. Logically, that means that you did find ado_xxx method for this tag, so you're going to call it. 记住,try..except同样可以使用else语句,它在try…except代码块没有异常的时候被调用。逻辑上,这说明对于标签你确实发现了方法-do_xxx,你马上就会调用它。。 |
|
By the way, don't worry about these different return values; in theory they mean something, but they're never actually used. Don't worry about theself.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now. 顺便说一下,对于不同的返回值你不必担心:理论上它们是有意义的,但是实际上它们从未被使用。对于self.stack.append(tag)你同样不必担心:SGMLParser将会在内部跟踪你的开始标签是否恰当使用结束所处理,但是对于这些信息SGMLParser类什么也不做。 |
|
start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call the method (start_xxx or do_xxx) with the list of attributes. Remember, method is a function, returned from getattr, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named, or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs. 方法start_xxx&do_xxx并没有被直接调用:标签,方法,属性都将作为参数传递给该函数,handler_starttag,因此后代类不能重写该方法,同样不能改变所有开始标签被分发。你不需了解到这种控制级别:因此你只需要让该方法做他自己的事,它将使用属性列表做参数调用方法start or do_xxx。记住,方法也是函数,它将从getattr中返回,而函数是对象。(我知道你早已经讨厌听到这些了,我保证当我使用完这些方法达到最优后,我将不再提及它们)。这里函数对象将作为参数传递给分发器函数,同时分发器函数反过来会调用该函数。在这里,你不必知道这个函数是什么,以及它的名字或是该函数在哪里定义:你唯一需要知道的事情就是:这个函数是使用参数attrs来调用的。 |
Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for and tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, you need to override the handle_data method.
现在我们回到标准的例程: Dialectizer。当你在离开该函数的时候,你正在处在对标签的特殊处理方法的定义之中。接着来只有一件事去做,那就是使用预先定义好的替换来处理文本。对于替换,你需要重写handle_data方法
Example 8.19. Overriding the handle_data method
例8.19 重写handle_data方法
Handle_data在调用的时候只接受一个参数:需要处理的文本。 |
|
In the ancestor BaseHTMLProcessor, the handle_data method
simply appended the text to the output buffer, self.pieces. Here the
logic is only slightly more complicated. If you're in the middle of a 在祖先类BaseHTMLProcessor中,handle_data只是简单的将文本追加到输出缓冲,self.pieces。这里逻辑上可能会略为复杂些。如果你在 |
You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough for one chapter.
你应该已经完全的理解 Dialectizer。唯一缺失的内容就是文本替换本身的特性。如果你知道任何关于Perl的知识,你知道当需要完整的文本替换时,真正唯一的解决方案就是正则表达式。在dialect.py 中类的后面定义一系列的正则表达式,它能对html标签之间的文本进行替换。但是你必须已经正则表达式那章都已经完全掌握。你真的不必在深究一遍正则表达式么?鬼才知道,反正我不认为对你已经完全掌握了那章的内容。
Slog 及物动词 vt. 猛击 步履艰难地行路