Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1791920
  • 博文数量: 335
  • 博客积分: 4690
  • 博客等级: 上校
  • 技术积分: 4341
  • 用 户 组: 普通用户
  • 注册时间: 2010-05-08 21:38
个人简介

无聊之人--除了技术,还是技术,你懂得

文章分类

全部博文(335)

文章存档

2016年(29)

2015年(18)

2014年(7)

2013年(86)

2012年(90)

2011年(105)

分类: Python/Ruby

2011-08-30 19:49:29

8.8. Introducing dialect.py

Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a 

...
block passes through unaltered.

Dialectizer 是一个简单的并且有点愚蠢的的类BaseHTMLProcesso的后代类。它通过一系列的替换来处理一段文本块,但是可以确定的是:被传递过去的标签

&
内的任何事情都没有被改变。

To handle the 

 blocks, you define two methods in Dialectizerstart_pre and end_pre.

为了处理

块,在类Dialectizer内定义两个方法:

  1. Example 8.17. Handling specific tags

  2.     def start_pre(self, attrs):
  3.         self.verbatim += 1
  4.         self.unknown_starttag("pre", attrs)
  5.  
  6.     def end_pre(self):
  7.         self.unknown_endtag("pre")
  8.         self.verbatim -= 1

     

start_pre is called every time SGMLParser finds a 

 tag in the HTML source. (In a minute,
  you'll see exactly how this happens.) The method takes a single parameter,attrs,
  which contains the attributes of the tag (if any). attrs is a list
  of key/value tuples, just like unknown_starttag takes.

每当SGMLParehtml文件中发现一个

时,它就调用方法start_pre。(稍后,你将会准确的这一切)。该方法接受一个参数attrs,它包含了标签的属性(如果该标签含有属性)。正如unknown_starttag接受的参数一样,Attrs是一个键值对组成的tuple

In the reset method, you initialize a data attribute that serves as a counter for 

 tags.
  Every time you hit a 
 tag, you increment the counter;
  every time you hit a
 tag, you'll decrement the counter. (You could just use this as a flag and set it to 1 and reset it to 0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested 
 tags.) In a minute, you'll see
  how this counter is put to good use.

reset方法中,你初始化了一个数据属性,它的主要作用是对标签

进行计数。每次当你发现一个
标签的时候,counter就增加1;每当你发现一个
的时候,就对couter减一。(你可以将它视作一个标识,可以设置为01,但是你不能简单的这么做。它能处理嵌套标签
是奇数次的情况(但是是可能哈))。马上,你将看到counter如何起作用。

That's it, that's the only special processing you do for 

 tags. Now you pass the list of
  attributes along to unknown_starttag so it can do the default
  processing.

哦,这就是你在处理

标签的时候唯一特殊的地方。同unknown_tag一样,你传一个属性列表,进行默认的处理。

end_pre is called every time SGMLParser finds a  tag. Since end tags can not contain attributes, the method takes no parameters.

每次当SGMLParser发现一个标签的时候,方法end_pre被调用。因为结束标签不能包含属性,该函数不带参数。

First, you want to do the default processing, just like any other end tag.

首先,同其它的对结束标签的处理一样,你想进行默认的处理过程。

Second, you decrement your counter to signal that this 

 block has been closed.

第二,你对counter进行减操作,它标着着

快已经结束。

At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle 

 and 
. But how does this happen? Well, it's not magic, it's just goodPython coding.

在此处,对SGMLParser类你需要深研究一下。我重复强调一遍(你要保持住这种信念):SGMLParser对每一个标签查找并调用对应的方法,如果该标签存在。比如,正如你刚才看到的那样,你定义了start_pre&end_pre方法来处理

&
标签。但是这是如何实现的呢?恩,这并不神奇,它只是好的Python编码。

Example 8.18. SGMLParser

8.18

  

  1. def finish_starttag(self, tag, attrs):
  2.         try:
  3.             method = getattr(self, 'start_' + tag)
  4.         except AttributeError:
  5.             try:
  6.                 method = getattr(self, 'do_' + tag)
  7.             except AttributeError:
  8.                 self.unknown_starttag(tag, attrs)
  9.                 return -1
  10.             else:
  11.                 self.handle_starttag(tag, method, attrs)
  12.                 return 0
  13.         else:
  14.             self.stack.append(tag)
  15.             self.handle_starttag(tag, method, attrs)
  16.             return 1
  17.  
  18.     def handle_starttag(self, tag, method, attrs):
  19.         method(attrs)

     

At this point, SGMLParser has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag).

在此处,SGMLParser已经发现了一个开始标签,然后将属性列表作为参数传递过去。该类剩下唯一需要做的事的决定对于这个标签是否存在一个特殊的处理方法或是回到默认的处理方法(unknown_starttag.

The “magic” of SGMLParser is nothing more than your old friend, getattr. What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself. Here the object is self, the current instance. So if tag is 'pre', this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class.

SGMLParser最神奇的就是你的老朋友:getattr.先前你可能还没有意识到:getattr能发现定义在后代类以及对象self自身中的方法,这里对象就是类实例本身。因此如果标签时‘pre’,getattr的调用将会查当前实例的start_pre方法,当前实例即Dialecizer类的实例。

getattr raises an AttributeError if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped the call togetattr inside a try...except block and explicitly caught the AttributeError.

如果getattr方法查找的方法在对象(或是该对象的所有后代类中)中没有定义,它将抛出AttributeERror异常,但是这是没有问题的,这是因为你将对该方法得调用通过使用try…except代码块封装到了togetattr中了,同时显式的捕捉了异常AttributeError.

Since you didn't find a start_xxx method, you'll also look for a do_xxx method before giving up. This alternate naming scheme is generally used for standalone tags, like
, which have no corresponding end tag. But you can use either naming scheme; as you can see, SGMLParser tries both for every tag. (You shouldn't define both astart_xxx and do_xxx handler method for the same tag, though; only the start_xxx method will get called.)

因为没有发现start_xxx方法,在放弃查找之前,你将接着查找do_xxx方法。这种可选的命名模式同手适用于可以独立标签,如
,z
这种标签没有对应的结束的标签。(对于同一个标签你不必同时定义start_xxx&do_xxx处理方法,尽管start方法将会被调用。

Another AttributeError, which means that the call to getattr failed with do_xxx. Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag.

另一个异常AttributeError,它意味着调用do_xxx方法,调用getattr失败。因为对于一个标签你要么发现一个start_xxx方法,要么查找到一个do_xxx方法,你捕捉到该异常,然后回到默认的处理方法:unknown_starttag.

Remember, try...except blocks can have an else clause, which is called if no exception is raised during the try...except block. Logically, that means that you did find ado_xxx method for this tag, so you're going to call it.

记住,try..except同样可以使用else语句,它在try…except代码块没有异常的时候被调用。逻辑上,这说明对于标签你确实发现了方法-do_xxx,你马上就会调用它。。

By the way, don't worry about these different return values; in theory they mean something, but they're never actually used. Don't worry about theself.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now.

顺便说一下,对于不同的返回值你不必担心:理论上它们是有意义的,但是实际上它们从未被使用。对于self.stack.append(tag)你同样不必担心:SGMLParser将会在内部跟踪你的开始标签是否恰当使用结束所处理,但是对于这些信息SGMLParser类什么也不做。

start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call the method (start_xxx or do_xxx) with the list of attributes. Remember, method is a function, returned from getattr, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named, or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs.

方法start_xxx&do_xxx并没有被直接调用:标签,方法,属性都将作为参数传递给该函数,handler_starttag,因此后代类不能重写该方法,同样不能改变所有开始标签被分发。你不需了解到这种控制级别:因此你只需要让该方法做他自己的事,它将使用属性列表做参数调用方法start or do_xxx。记住,方法也是函数,它将从getattr中返回,而函数是对象。(我知道你早已经讨厌听到这些了,我保证当我使用完这些方法达到最优后,我将不再提及它们)。这里函数对象将作为参数传递给分发器函数,同时分发器函数反过来会调用该函数。在这里,你不必知道这个函数是什么,以及它的名字或是该函数在哪里定义:你唯一需要知道的事情就是:这个函数是使用参数attrs来调用的。

Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for 

 and 
 tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, you need to override the handle_data method.

现在我们回到标准的例程: Dialectizer。当你在离开该函数的时候,你正在处在对标签

的特殊处理方法的定义之中。接着来只有一件事去做,那就是使用预先定义好的替换来处理文本。对于替换,你需要重写handle_data方法

Example 8.19. Overriding the handle_data method

8.19 重写handle_data方法

   

  1. def handle_data(self, text):
  2.         self.pieces.append(self.verbatim and text or self.process(text))

  3. handle_data is called with only one argument, the text to process.

Handle_data在调用的时候只接受一个参数:需要处理的文本。

In the ancestor BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of a 

...
 block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using the and-or trick.

在祖先类BaseHTMLProcessor中,handle_data只是简单的将文本追加到输出缓冲,self.pieces。这里逻辑上可能会略为复杂些。如果你在

&
文本块中间,self.verbatim的值将会是大于0的值,同时你想让输出缓冲不改变任何文本内容。否则,你将调用单独的方法来处理这种替换,接着将替换的结果放到输出缓冲池中。在Python中,这就是使用and-or技巧的一行代码。

You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough for one chapter.

你应该已经完全的理解 Dialectizer。唯一缺失的内容就是文本替换本身的特性。如果你知道任何关于Perl的知识,你知道当需要完整的文本替换时,真正唯一的解决方案就是正则表达式。在dialect.py 中类的后面定义一系列的正则表达式,它能对html标签之间的文本进行替换。但是你必须已经正则表达式那章都已经完全掌握。你真的不必在深究一遍正则表达式么?鬼才知道,反正我不认为对你已经完全掌握了那章的内容。

Slog 及物动词 vt. 猛击  步履艰难地行路

 

阅读(1505) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~