犹大huaius.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

huaius

博客访问： 2483505
博文数量： 328
博客积分： 4302
博客等级：上校
技术积分： 5486
用户组：普通用户
注册时间： 2010-07-01 11:14

个人简介

悲剧，绝对的悲剧，悲剧中的悲剧。

文章分类

全部博文（328）

Automation（3）
云计算（17）
数据库（41）
程序设计（104）

算法（1）

Java（10）

Python（36）

C / C++（8）

版本控制（14）

Perl 编程（29）

Shell 编程（6）
Web开发（25）
杂谈（4）
网络相关（22）
系统相关（87）

iOS（9）

ESX（9）

AIX（4）

HP UX（5）

Linux（24）

Solaris（21）

磁盘相关（9）
安全相关（3）
Unix 命令（22）
未分配的博文（0）

文章存档

2017年（6）

2016年（18）

2015年（28）

2014年（73）

2013年（62）

2012年（58）

2011年（55）

2010年（28）

我的朋友

1. 解析html

下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中，自己运行看看。

from BeautifulSoup import BeautifulSoup
import re
doc = ['Page title',
'
This is paragraph one.',
'
This is paragraph two.',
'']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>

navigate soup的一些方法:

soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
head.nextSibling.contents[0].nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

findAll方法中的text 是一个用于搜索NavigableString对象的参数。它的值可以是字符串，一个正则表达式，一个list或dictionary，True或None，一个以NavigableString为参数的可调用对象,如果你使用text，任何指定给name 以及keyword参数的值都会被忽略。

soup.findAll(text="one")
# [u'one']
soup.findAll(text=u'one')
# [u'one']
soup.findAll(text=["one", "two"])
# [u'one', u'two']
soup.findAll(text=re.compile("paragraph"))
# [u'This is paragraph ', u'This is paragraph ']
soup.findAll(text=True)
# [u'Page title', u'This is paragraph ', u'one', u'.', u'This is paragraph ',
# u'two', u'.']
soup.findAll(text=lambda(x): len(x) < 12)
# [u'Page title', u'one', u'.', u'two', u'.']

下面的两个函数分别是获得html某元素子元素的所有文本内容，以及获得元素后续所有兄弟元素的文本内容

def get_all_text_from_soup(item):
'''item is a soup item, this sub is to find all text which is in this item'''
if (item.__class__.__name__ == 'NavigableString'):
output = item.string;
else:
output = u''.join(item.findAll(text=True));
return output;
def get_all_text_next_soup(item):
output = u'';
while(True):
brother = item.nextSibling;
if brother:
output = output + get_all_text_from_soup(brother);
item = brother;
else:
break;
return output;

2. 生成html

from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup()
mem_attr = ['Description', 'PhysicalID', 'Slot', 'Size', 'Width']
html = Tag(soup, "html")
table = Tag(soup, "table")
tr = Tag(soup, "tr")
soup.append(html)
html.append(table)
table.append(tr)
for attr in mem_attr:
th = Tag(soup, "th")
tr.append(th)
th.append(attr)
print soup.prettify()

另一种生成html的方法是利用pyh，这是一个很轻巧方便的途径，令人吃惊的是这个文件只有145行
# wc -l /usr/local/lib/python2.6/dist-packages/pyh.py
145 /usr/local/lib/python2.6/dist-packages/pyh.py

from pyh import *
page = PyH('My wonderful PyH page')
page.addCSS('myStylesheet1.css', 'myStylesheet2.css')
page.addJS('myJavascript1.js', 'myJavascript2.js')
page << h1('My big title', cl='center')
page << div(cl='myCSSclass1 myCSSclass2', id='myDiv1') << p('I love PyH!', id='myP1')
mydiv2 = page << div(id='myDiv2')
mydiv2 << h2('A smaller title') + p('Followed by a paragraph.')
page << div(id='myDiv3')
page.myDiv3.attributes['cl'] = 'myCSSclass3'
page.myDiv3 << p('Another paragraph')
page.printOut()

会得到如下输出

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html lang="en" xmlns="">
<head>
<title>My wonderful PyH page</title>
<link href="myStylesheet1.css" type="text/css" rel="stylesheet" />
<link href="myStylesheet2.css" type="text/css" rel="stylesheet" />
<script src="myJavascript1.js" type="text/javascript"></script>
<script src="myJavascript2.js" type="text/javascript"></script>
</head>
<body>
<h1 class="center">My big title</h1>
<div id="myDiv1" class="myCSSclass1 myCSSclass2">
<p id="myP1">I love PyH!>
</div>
<div id="myDiv2">
<h2>A smaller title</h2>
<p>Followed by a paragraph.</p>
</div>
<div id="myDiv3" class="myCSSclass3">
<p>Another paragraph</p>
</div>
</body>
</html>

项目地址为：

阅读(3543) | 评论(0) | 转发(2) |

上一篇：利用Bicho抓取基于Jira的缺陷报告库

下一篇：Python自省（反射）指南 -- 对象高级操作

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6