journal of python(47)　－－dive into python 　-kinfinger-ChinaUnix博客

kinfingerasage.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

kinfinger

博客访问： 1823272
博文数量： 335
博客积分： 4690
博客等级：上校
技术积分： 4341
用户组：普通用户
注册时间： 2010-05-08 21:38

个人简介

无聊之人--除了技术，还是技术，你懂得

文章分类

全部博文（335）

ORACEL（4）
REDIS（1）
REDIS（0）
LINUX/UNIX（4）
PHP（1）
COGNOS（1）
COBOL（4）
CICS（3）
EXCEL（2）
MYSQL（7）
DB2（72）
TWS（0）
SA（0）
mainframe（16）
web（10）

javascript（1）
APUE（43）
REXX（7）
work（13）
life（13）
python（95）
c/c++（26）
asage（7）
未分配的博文（6）

文章存档

2016年（29）

2015年（18）

2014年（7）

2013年（86）

2012年（90）

2011年（105）

我的朋友

相关博文

journal of python(47)　－－dive into python 　

分类： Python/Ruby

2011-08-23 21:26:19

8.1. Diving in

I often see questions on like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of myHTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions.

我在com.lang.python上面经常看到这样的问题：我应该如何做才能把html文件中的headers，images，links，都显示出来？我应该如何解析（转换，或是，毁坏）（MUNGe ,"MASH UNTIL NO GOOD）出我的html文档然后单独留下文档？或是如何在我的html文档中一次的增加（移除，引用）标签的属性?本章将会解决所有的这些问题。

Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the doc strings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how any of these class methods ever get called. Don't worry, all will be revealed in due time.

这个完整的功能齐全的Python程序包含两个部分。第一个部分，BaseHTMLProcessor.py，是一个通用的帮助你解决处理HTML文件的工具，它通过遍历文件中的标签和文本块实现。第二部分，dialect.py，举例展示了如何使用BaseHTMLProcessor.py来将html文件转换出文本文件，但是仅留下标签。请阅读doc string和注释来整体的掌握程序的功能。其中很大的一部分看起来都像黑色魔术，因为它们没有很明显的说明这些类方法是如何被调用。不用担心，在合适的时候，所有的东西都会显现出来。

Example 8.1. BaseHTMLProcessor.py

例8. 1 BaseHTMLProcessor.py

If you have not already done so, you can used in this book.

如果你现在都还没有下载，你可以以去 下载本书上用到的代码。

from sgmllib import SGMLParser
import htmlentitydefs
class BaseHTMLProcessor(SGMLParser):
def reset(self):
# extend (called by SGMLParser.__init__)
self.pieces = []
SGMLParser.reset(self)
def unknown_starttag(self, tag, attrs):
# called for each start tag
# attrs is a list of (attr, value) tuples
# e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
# Ideally we would like to reconstruct original tag and attributes, but
# we may end up quoting attribute values that weren't quoted in the source
# document, or we may change the type of quotes around the attribute value
# (single to double quotes).
# Note that improperly embedded non-HTML code (like client-side Javascript)
# may be parsed incorrectly by the ancestor, causing runtime script errors.
# All non-HTML code must be enclosed in HTML comment tags ()
# to ensure that it will pass through this parser unaltered (in handle_comment).
strattrs = "".join([' %s="%s"

Example 8.2. dialect.py

例8.2dialect.py

import re
from BaseHTMLProcessor import BaseHTMLProcessor
class Dialectizer(BaseHTMLProcessor):
subs = ()
def reset(self):
# extend (called from __init__ in ancestor)
# Reset all data attributes
self.verbatim = 0
BaseHTMLProcessor.reset(self)
def start_pre(self, attrs):
# called for every <pre> tag in HTML source
# Increment verbatim mode count, then handle tag like normal
self.verbatim += 1
self.unknown_starttag("pre", attrs)
def end_pre(self):
# called for every </pre> tag in HTML source
# Decrement verbatim mode count
self.unknown_endtag("pre")
self.verbatim -= 1
def handle_data(self, text):
# override
# called for every block of text in HTML source
# If in verbatim mode, save text unaltered;
# otherwise process the text with a series of substitutions
self.pieces.append(self.verbatim and text or self.process(text))
def process(self, text):
# called from handle_data
# Process text block by performing series of regular expression
# substitutions (actual substitions are defined in descendant)
for fromPattern, toPattern in self.subs:
text = re.sub(fromPattern, toPattern, text)
return text
class ChefDialectizer(Dialectizer):
"""convert HTML to Swedish Chef-speak
"""
subs = ((r'a([nu])', r'u\1'),
(r'A([nu])', r'U\1'),
(r'a\B', r'e'),
(r'A\B', r'E'),
(r'en\b', r'ee'),
(r'\Bew', r'oo'),
(r'\Be\b', r'e-a'),
(r'\be', r'i'),
(r'\bE', r'I'),
(r'\Bf', r'ff'),
(r'\Bir', r'ur'),
(r'(\w*?)i(\w*?)$', r'\1ee\2'),
(r'\bow', r'oo'),
(r'\bo', r'oo'),
(r'\bO', r'Oo'),
(r'the', r'zee'),
(r'The', r'Zee'),
(r'th\b', r't'),
(r'\Btion', r'shun'),
(r'\Bu', r'oo'),
(r'\BU', r'Oo'),
(r'v', r'f'),
(r'V', r'F'),
(r'w', r'w'),
(r'W', r'W'),
(r'([a-z])[.]', r'\1. Bork Bork Bork!'))
class FuddDialectizer(Dialectizer):
"""convert HTML to Elmer Fudd-speak"""
subs = ((r'[rl]', r'w'),
(r'qu', r'qw'),
(r'th\b', r'f'),
(r'th', r'd'),
(r'n[.]', r'n, uh-hah-hah-hah.'))
class OldeDialectizer(Dialectizer):
"""convert HTML to mock Middle English"""
subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
(r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
(r'ick\b', r'yk'),
(r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
(r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
(r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
(r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
(r'([aeiou])re\b', r'\1r'),
(r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
(r'tion\b', r'cioun'),
(r'ion\b', r'ioun'),
(r'aid', r'ayde'),
(r'ai', r'ey'),
(r'ay\b', r'y'),
(r'ay', r'ey'),
(r'ant', r'aunt'),
(r'ea', r'ee'),
(r'oa', r'oo'),
(r'ue', r'e'),
(r'oe', r'o'),
(r'ou', r'ow'),
(r'ow', r'ou'),
(r'\bhe', r'hi'),
(r've\b', r'veth'),
(r'se\b', r'e'),
(r"'s\b", r'es'),
(r'ic\b', r'ick'),
(r'ics\b', r'icc'),
(r'ical\b', r'ick'),
(r'tle\b', r'til'),
(r'll\b', r'l'),
(r'ould\b', r'olde'),
(r'own\b', r'oune'),
(r'un\b', r'onne'),
(r'rry\b', r'rye'),
(r'est\b', r'este'),
(r'pt\b', r'pte'),
(r'th\b', r'the'),
(r'ch\b', r'che'),
(r'ss\b', r'sse'),
(r'([wybdp])\b', r'\1e'),
(r'([rnt])\b', r'\1\1e'),
(r'from', r'fro'),
(r'when', r'whan'))
def translate(url, dialectName="chef"):
"""fetch URL and translate using dialect
dialect in ("chef", "fudd", "olde")"""
import urllib
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
parserName = "%sDialectizer" % dialectName.capitalize()
parserClass = globals()[parserName]
parser = parserClass()
parser.feed(htmlSource)
parser.close()
return parser.output()
def test(url):
"""test all dialects against URL"""
for dialect in ("chef", "fudd", "olde"):
outfile = "%s.html" % dialect
fsock = open(outfile, "wb")
fsock.write(translate(url, dialect))
fsock.close()
import webbrowser
webbrowser.open_new(outfile)
if __name__ == "__main__":
test("")

Example 8.3. Output of dialect.py

例8.3 dialect.py的输出

Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the code listings and screen examples were left untouched.

运行这个脚本将会转换 Section 3.2, “Introducing Lists”为mock Swedish Chef-speak, mock Elmer Fudd-speak(来源于Bugs,Bunny,cartoons)以及 mock Middle English（或许基于Chaucer's The Canterbury Tales)）.如果你自己查看输出内容的html源代码，你会发现所有的html标签和属性都没有遭到破坏，但是标签内的文本都被转换成mock 语言。事实上如果你仔细看，只有文件的标题段落被转换，下面被展示的代码和例子没有被转换。（这段很晦涩，等全部阅读完这章，在重新修订一下）

<div class="abstract">
<p>Lists awe <span class="application">Pydon</span>

阅读(1291) | 评论(0) | 转发(0) |

上一篇：journal of python(46)　－－dive into python 　

下一篇：journal of python(48)　－－dive into python 　

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6