python插件之beautifulsoup-niuniu2006t-ChinaUnix博客

AlanHomealanhome.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

niuniu2006t

博客访问： 512329
博文数量： 137
博客积分： 3874
博客等级：中校
技术积分： 1475
用户组：普通用户
注册时间： 2010-07-05 10:50

文章分类

全部博文（137）

macbook pro（1）
web（8）
operating system（4）
好玩（7）
ubuntu（30）
regular expressi（2）
GAE（0）
随记（15）
awk（1）
windows（1）
google api（3）
emacs/vi（4）
c/c++（6）
android（32）
hadoop（4）
algorithm（13）
python（6）
未分配的博文（0）

文章存档

2011年（37）

2010年（100）

我的朋友

相关博文

python插件之beautifulsoup

分类： LINUX

2010-09-20 10:17:43

Beautiful Soup 是 Python 内置的网页分析工具，名字叫美丽的蝴蝶，是一个可以快速地解析网页内容的Python HTML/XML 解析器。

重要特性：

可接受损坏的标签文档，在内部生成一棵剖析树，并尽可能和你的原文档一致。通常可以满足搜集数据的需求。
提供和python语法相近的命令来查找、编辑。它提供一个工具集帮助你解析并解释出你需要的内容。这样你就不必为每一个应用创建自己的解析工具。
自动将传进来的文档转换为 Unicode 编码，输出的时候转换为 UTF-8。可以解析任何你提供的文档，做解析的事情。你可以命令他“找出所有的链接"，或者"找出所有 class 是 externalLink 的链接"，或是"找出所有的 url 匹配正则表达式 ”foo.com" 的链接，甚至可以是这样的命令---“找出那些表头是粗体文字，然后返回给我文字”。

　　在 BeautifulSoup 的帮助下，原本要花数个小时的工作，通过 Beautiful Soup 几分钟即可搞定。

　　下面让我们看看几个样例。

   from BeautifulSoup import BeautifulSoup    #解析HTML  

   from BeautifulSoup import BeautifulStoneSoup  　#解析XML　

   import BeautifulSoup　　#获取任何信息

　　下面使用一段代码演示Beautiful Soup的基本使用方式。你可以拷贝与粘贴这段代码自己运行。

　　下面是一个解析文档的方法：

soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
#Page title
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
#
This is paragraph one.

head.nextSibling.contents[0].nextSibling
#This is paragraph two.

　　接着是一打方法查找文档中包含的标签，或者含有指定属性的标签

titleTag = soup.html.head.title
titleTag
#Page title
titleTag.string
# u'Page title'
len(soup('p'))
# 2
soup.findAll('p'， align="center")
# [
This is paragraph one.
]
soup.find('p'， align="center")
#This is paragraph one.
soup('p'， align="center")[0]['id']
# u'firstpara'
soup.find('p'， align=re.compile('^b.*'))['id']
# u'secondpara'
soup.find('p').b.string
# u'one'
soup('p')[1].b.string
# u'two'

　　当然也可以简单地修改文档

titleTag['id'] = 'theTitle'
titleTag.contents[0].replaceWith("New title")
soup.html.head
# New title
soup.p.extract()
soup.prettify()
# 
# 
#  </span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; "><br style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: Verdana, Arial, Helvetica, sans-serif; color: rgb(73, 73, 73); font-size: 14px; line-height: 10px; ">#</span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; ">   New title</span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; "><br style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: Verdana, Arial, Helvetica, sans-serif; color: rgb(73, 73, 73); font-size: 14px; line-height: 10px; ">#</span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; ">  
# 
# 
#  

#   This is paragraph
#  
#   two
#  
#  .
#  

# 

# 
soup.p.replaceWith(soup.b)
# 
# 
#  </span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; "><br style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: Verdana, Arial, Helvetica, sans-serif; color: rgb(73, 73, 73); font-size: 14px; line-height: 10px; ">#</span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; ">   New title</span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; "><br style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: Verdana, Arial, Helvetica, sans-serif; color: rgb(73, 73, 73); font-size: 14px; line-height: 10px; ">#</span><span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: 'Courier New'; color: rgb(0, 128, 0); font-size: 14px; line-height: 25px; ">  
# 
# 
#  
#   two
#  
# 
# 
soup.body.insert(0, "This page used to have ")
soup.body.insert(2, " <p> tags!")
soup.body
# This page used to have two <p> tags!

　　最后，为大家提供 Beautiful Soup 的文档。希望能对您有帮助。

　　英文原文： (翻译有删节，请查看原文链接)

来自:

阅读(1429) | 评论(1) | 转发(0) |

上一篇：第一个python程序

下一篇：android 提示框

给主人留下些什么吧！~~

chinaunix网友2010-09-21 10:09:09

很好的, 收藏了推荐一个博客，提供很多免费软件编程电子书下载： http://free-ebooks.appspot.com

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6