chinaboy小宝chinaboy007.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

chinaboywg

博客访问： 2907468
博文数量： 348
博客积分： 2907
博客等级：中校
技术积分： 2272
用户组：普通用户
注册时间： 2010-03-12 09:16

个人简介

专注 K8S研究

文章分类

全部博文（348）

elk（2）
docker（5）
error（0）
zabbix（21）
haproxy（2）
linux（11）
redis（2）
lvs（9）
squid（8）
nagios（4）
puppet（6）
html（1）
nginx（45）
apache（3）
mysql（65）
php（0）
python（114）

pycharm（1）

pip（1）

requests（1）

requests（0）

urllib（0）

logging（1）

flask（0）

lib（0）

pyqt4（14）

django（7）

beautifulsoup（11）

scrapy（3）

string（6）

pexpect（4）
shell（19）
linux（25）
other（4）
未分配的博文（2）

文章存档

2019年（22）

2018年（57）

2016年（2）

2015年（27）

2014年（33）

2013年（190）

2011年（3）

2010年（14）

我的朋友

1. 解析html

下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中，自己运行看看。

from BeautifulSoup import BeautifulSoup
import re
doc = ['Page title',
'
This is paragraph one.',
'
This is paragraph two.',
'']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>

navigate soup的一些方法:

soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
head.nextSibling.contents[0].nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

findAll方法中的text 是一个用于搜索NavigableString对象的参数。它的值可以是字符串，一个正则表达式，一个list或dictionary，True或None，一个以NavigableString为参数的可调用对象,如果你使用text，任何指定给name 以及keyword参数的值都会被忽略。

soup.findAll(text="one")
# [u'one']
soup.findAll(text=u'one')
# [u'one']
soup.findAll(text=["one", "two"])
# [u'one', u'two']
soup.findAll(text=re.compile("paragraph"))
# [u'This is paragraph ', u'This is paragraph ']
soup.findAll(text=True)
# [u'Page title', u'This is paragraph ', u'one', u'.', u'This is paragraph ',
# u'two', u'.']
soup.findAll(text=lambda(x): len(x) < 12)
# [u'Page title', u'one', u'.', u'two', u'.']

下面的两个函数分别是获得html某元素子元素的所有文本内容，以及获得元素后续所有兄弟元素的文本内容

def get_all_text_from_soup(item):
'''item is a soup item, this sub is to find all text which is in this item'''
if (item.__class__.__name__ == 'NavigableString'):
output = item.string;
else:
output = u''.join(item.findAll(text=True));
return output;
def get_all_text_next_soup(item):
output = u'';
while(True):
brother = item.nextSibling;
if brother:
output = output + get_all_text_from_soup(brother);
item = brother;
else:
break;
return output;

2. 生成html

from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup()
mem_attr = ['Description', 'PhysicalID', 'Slot', 'Size', 'Width']
html = Tag(soup, "html")
table = Tag(soup, "table")
tr = Tag(soup, "tr")
soup.append(html)
html.append(table)
table.append(tr)
for attr in mem_attr:
th = Tag(soup, "th")
tr.append(th)
th.append(attr)
print soup.prettify()

另一种生成html的方法是利用pyh，这是一个很轻巧方便的途径，令人吃惊的是这个文件只有145行
# wc -l /usr/local/lib/python2.6/dist-packages/pyh.py
145 /usr/local/lib/python2.6/dist-packages/pyh.py

from pyh import *
page = PyH('My wonderful PyH page')
page.addCSS('myStylesheet1.css', 'myStylesheet2.css')
page.addJS('myJavascript1.js', 'myJavascript2.js')
page << h1('My big title', cl='center')
page << div(cl='myCSSclass1 myCSSclass2', id='myDiv1') << p('I love PyH!', id='myP1')
mydiv2 = page << div(id='myDiv2')
mydiv2 << h2('A smaller title') + p('Followed by a paragraph.')
page << div(id='myDiv3')
page.myDiv3.attributes['cl'] = 'myCSSclass3'
page.myDiv3 << p('Another paragraph')
page.printOut()

会得到如下输出

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html lang="en" xmlns="">
<head>
<title>My wonderful PyH page</title>
<link href="myStylesheet1.css" type="text/css" rel="stylesheet" />
<link href="myStylesheet2.css" type="text/css" rel="stylesheet" />
<script src="myJavascript1.js" type="text/javascript"></script>
<script src="myJavascript2.js" type="text/javascript"></script>
</head>
<body>
<h1 class="center">My big title</h1>
<div id="myDiv1" class="myCSSclass1 myCSSclass2">
<p id="myP1">I love PyH!>
</div>
<div id="myDiv2">
<h2>A smaller title</h2>
<p>Followed by a paragraph.</p>
</div>
<div id="myDiv3" class="myCSSclass3">
<p>Another paragraph</p>
</div>
</body>
</html>

项目地址为：

阅读(11214) | 评论(0) | 转发(0) |

上一篇：Python语言十分钟快速入门

下一篇：Windows下安装Python SSH模块及其使用

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6