专注 K8S研究
分类: Python/Ruby
2013-07-06 14:34:14
标签:
beautifulsoup杂谈 |
分类: 学习相关 |
Beautiful Soup 4的安装及相关问题
Beautiful Soup的最新版本是4.1.1可以在此获取()
文档:
()
使用:
from bs4 import BeautifulSoup
Example:
html文件:
html_doc = """ The
Dormouse's
story Once upon a
time there were three little sisters; and their names were
...
"""代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
接下来可以开始使用各种功能
soup.X (X为任意标签,返回整个标签,包括标签的属性,内容等)
如:soup.title
#
soup.p
#
The
Dormouse's
story
soup.a (注:仅仅返回第一个结果)
# Elsie
soup.find_all('a') (find_all 可以返回所有)
# [Elsie,
# Lacie,
# Tillie]
find还可以按属性查找
soup.find(id="link3")
# Tillie
要取某个标签的某个属性,可用函数有 find_all,get
for link in soup.find_all('a'):
print(link.get('href'))
#
#
#
要取html文件中的所有文本,可使用get_text()
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
如果是打开html文件,语句可用:
soup = BeautifulSoup(open("index.html"))
BeautifulSoup中的Object
tag (对应html中的标签)
tag.attrs (以字典形式返回tag的所有属性)
可以直接对tag的属性进行增、删、改,跟操作字典一样
tag['class'] = 'verybold'
tag['id'] = 1
tag
# Extremely
bold
del tag['class']
del tag['id']
tag
# Extremely
bold
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
X.contents (X为标签,可返回标签的内容)
eg.
head_tag = soup.head
head_tag
#
head_tag.contents
[
title_tag = head_tag.contents[0]
title_tag
#
title_tag.contents
# [u'The Dormouse's story']
解决解析网页出现乱码问题:
|
import urllib2 |
2 | from |