下面的内容节选自
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。 它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。它可以大大节省你的编程时间。 对于Ruby,使用Rubyful Soup。
1. 解析html
下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中,自己运行看看。
-
from BeautifulSoup import BeautifulSoup
-
import re
-
-
doc = ['Page title',
-
'
This is paragraph one.'
,
-
'
This is paragraph two.'
,
-
'']
-
soup = BeautifulSoup(''.join(doc))
-
-
print soup.prettify()
-
# <html>
-
# <head>
-
# <title>
-
# Page title
-
# </title>
-
# </head>
-
# <body>
-
# <p id="firstpara" align="center">
-
# This is paragraph
-
# <b>
-
# one
-
# </b>
-
# .
-
# </p>
-
# <p id="secondpara" align="blah">
-
# This is paragraph
-
# <b>
-
# two
-
# </b>
-
# .
-
# </p>
-
# </body>
-
# </html>
navigate soup的一些方法:
-
soup.contents[0].name
-
# u'html'
-
-
soup.contents[0].contents[0].name
-
# u'head'
-
-
head = soup.contents[0].contents[0]
-
head.parent.name
-
# u'html'
-
-
head.next
-
# <title>Page title</title>
-
-
head.nextSibling.name
-
# u'body'
-
-
head.nextSibling.contents[0]
-
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
-
-
head.nextSibling.contents[0].nextSibling
-
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
findAll方法中的text 是一个用于搜索NavigableString对象的参数。 它的值可以是字符串,一个正则表达式, 一个list或dictionary,True或None, 一个以NavigableString为参数的可调用对象,如果你使用text,任何指定给name 以及keyword参数的值都会被忽略。
-
soup.findAll(text="one")
-
# [u'one']
-
soup.findAll(text=u'one')
-
# [u'one']
-
-
soup.findAll(text=["one", "two"])
-
# [u'one', u'two']
-
-
soup.findAll(text=re.compile("paragraph"))
-
# [u'This is paragraph ', u'This is paragraph ']
-
-
soup.findAll(text=True)
-
# [u'Page title', u'This is paragraph ', u'one', u'.', u'This is paragraph ',
-
# u'two', u'.']
-
-
soup.findAll(text=lambda(x): len(x) < 12)
-
# [u'Page title', u'one', u'.', u'two', u'.']
下面的两个函数分别是获得html某元素子元素的所有文本内容,以及获得元素后续所有兄弟元素的文本内容
-
def get_all_text_from_soup(item):
-
'''item is a soup item, this sub is to find all text which is in this item'''
-
if (item.__class__.__name__ == 'NavigableString'):
-
output = item.string;
-
else:
-
output = u''.join(item.findAll(text=True));
-
return output;
-
-
def get_all_text_next_soup(item):
-
output = u'';
-
while(True):
-
brother = item.nextSibling;
-
if brother:
-
output = output + get_all_text_from_soup(brother);
-
item = brother;
-
else:
-
break;
-
return output;
2. 生成html
-
from BeautifulSoup import BeautifulSoup, Tag
-
soup = BeautifulSoup()
-
mem_attr = ['Description', 'PhysicalID', 'Slot', 'Size', 'Width']
-
html = Tag(soup, "html")
-
table = Tag(soup, "table")
-
tr = Tag(soup, "tr")
-
soup.append(html)
-
html.append(table)
-
table.append(tr)
-
for attr in mem_attr:
-
th = Tag(soup, "th")
-
tr.append(th)
-
th.append(attr)
-
-
print soup.prettify()
另一种生成html的方法是利用pyh,这是一个很轻巧方便的途径,令人吃惊的是这个文件只有145行
# wc -l /usr/local/lib/python2.6/dist-packages/pyh.py
145 /usr/local/lib/python2.6/dist-packages/pyh.py
-
from pyh import *
-
page = PyH('My wonderful PyH page')
-
page.addCSS('myStylesheet1.css', 'myStylesheet2.css')
-
page.addJS('myJavascript1.js', 'myJavascript2.js')
-
page << h1('My big title', cl='center')
-
page << div(cl='myCSSclass1 myCSSclass2', id='myDiv1') << p('I love PyH!', id='myP1')
-
mydiv2 = page << div(id='myDiv2')
-
mydiv2 << h2('A smaller title') + p('Followed by a paragraph.')
-
page << div(id='myDiv3')
-
page.myDiv3.attributes['cl'] = 'myCSSclass3'
-
page.myDiv3 << p('Another paragraph')
-
page.printOut()
会得到如下输出
-
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
-
<html lang="en" xmlns="">
-
<head>
-
<title>My wonderful PyH page</title>
-
<link href="myStylesheet1.css" type="text/css" rel="stylesheet" />
-
<link href="myStylesheet2.css" type="text/css" rel="stylesheet" />
-
<script src="myJavascript1.js" type="text/javascript"></script>
-
<script src="myJavascript2.js" type="text/javascript"></script>
-
</head>
-
<body>
-
<h1 class="center">My big title</h1>
-
<div id="myDiv1" class="myCSSclass1 myCSSclass2">
-
<p id="myP1">I love PyH!>
-
</div>
-
<div id="myDiv2">
-
<h2>A smaller title</h2>
-
<p>Followed by a paragraph.</p>
-
</div>
-
<div id="myDiv3" class="myCSSclass3">
-
<p>Another paragraph</p>
-
</div>
-
</body>
-
</html>
项目地址为:
阅读(3406) | 评论(0) | 转发(2) |