分类: Python/Ruby
2013-07-24 12:47:24
原文地址:Beautiful Soup 帮助文档1 快速入门 作者:oychw
*在程序中中导入 Beautiful Soup库:
from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup
*Beautiful Soup基本功能的演示:
#!/usr/bin/env python
# -*- coding: gbk -*-
#gtalk: ouyangchongwu#gmail.com
#python qq group: 深圳自动化测试python 113938272
import sys
#设定字符编码为GBK
reload(sys)
sys.setdefaultencoding('gbk')
from BeautifulSoup import BeautifulSoup
import re
doc = ['Page title ',
'This is paragraph one.',
'This is paragraph two.',
'']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
执行结果:
Page title
This is paragraph
one
.
This is paragraph
two
.
*导航soup的一些方法:
获取第一标签的名字:
>>> soup.contents[0].name
u'html'
获取第一标签的第一个子标签的名字
>>> soup.contents[0].contents[0].name
u'head'
获取父标签的名字
>>> head = soup.contents[0].contents[0]
>>> head.parent.name
u'html'
下一个标签,注意此时head依旧是"Page title "
>>> head.nextPage title
下一个兄弟标签的名字
>>> head.nextSibling.name
u'body'
下一个兄弟标签的名字的第一个子标签
head.nextSibling.contents[0]This is paragraph one.
下一个兄弟标签的名字的第一个子标签的下一个兄弟标签
>>> head.nextSibling.contents[0].nextSiblingThis is paragraph two.
*在soup中查找指定标签或有着指定属性的标签
>>> titleTag = soup.html.head.title
>>> titleTagPage title
>>> titleTag.string
u'Page title'
>>> len(soup('p'))
2
这里是表示有2个p标签
>>> soup.findAll('p', align="center")
[This is paragraph one.
]
>>> soup.find('p', align="center")This is paragraph one.
>>> soup('p', align="center")[0]['id']
u'firstpara'
>>> soup.find('p', align=re.compile('^b.*'))['id']
u'secondpara'
>>> soup.find('p').b.string
u'one'
>>> soup('p')[1].b.string
u'two'
*修改soup
为title增加id
>>> titleTag['id'] = 'theTitle'
修改title
>>> titleTag.contents[0].replaceWith("New title")
>>> soup.html.headNew title
去掉第一个p标签
>>> soup.p.extract()This is paragraph one.
>>> print soup.prettify()
New title
This is paragraph
two
.
标签互换
>>> soup.p.replaceWith(soup.b)
>>> print soup.prettify()
New title
two
>>> soup.body.insert(0, "This page used to have ")
>>> soup.body.insert(2, " <p> tags!")
>>> soup.body
This page used to have This page used to have <p> tags!two
>>>
*应用实例:抓取一个网页的所有链接:
#!/usr/bin/env python
# -*- coding: gbk -*-
#gtalk: ouyangchongwu#gmail.com
#python qq group: 深圳自动化测试python 113938272
import sys
#设定字符编码为GBK
reload(sys)
sys.setdefaultencoding('gbk')
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://blog.chinaunix.net/u/21908/")
soup = BeautifulSoup(page)
for incident in soup('a'):
print incident['href']
以上用正则表达式也是可以实现的,只不过使用BeautifulSoup不要去构造匹配字符串。