BeautifulSoup的解析a href-chinaboywg-ChinaUnix博客

Chinaunix首页 | 论坛 | 博客

chinaboy小宝chinaboy007.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

博客访问： 2922824
博文数量： 348
博客积分： 2907
博客等级：中校
技术积分： 2272
用户组：普通用户
注册时间： 2010-03-12 09:16

个人简介

专注 K8S研究

文章分类

全部博文（348）

elk（2）
docker（5）
error（0）
zabbix（21）
haproxy（2）
linux（11）
redis（2）
lvs（9）
squid（8）
nagios（4）
puppet（6）
html（1）
nginx（45）
apache（3）
mysql（65）
php（0）
python（114）

pycharm（1）

pip（1）

requests（1）

requests（0）

urllib（0）

logging（1）

flask（0）

lib（0）

pyqt4（14）

django（7）

beautifulsoup（11）

scrapy（3）

string（6）

pexpect（4）
shell（19）
linux（25）
other（4）
未分配的博文（2）

文章存档

2019年（22）

2018年（57）

2016年（2）

2015年（27）

2014年（33）

2013年（190）

2011年（3）

2010年（14）

我的朋友

最近访客

推荐博文

相关博文

BeautifulSoup的解析a href

分类： Python/Ruby

2013-07-06 14:34:14

BeautifulSoup的使用

(2012-12-19 11:00:01)

标签：

beautifulsoup

杂谈

分类：学习相关

Beautiful Soup 4的安装及相关问题

Beautiful Soup的最新版本是4.1.1可以在此获取（）

文档：

（）

使用：

from bs4 import BeautifulSoup

Example：

html文件：

html_doc = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""

代码：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

接下来可以开始使用各种功能

soup.X (X为任意标签，返回整个标签，包括标签的属性，内容等）

如：soup.title

# The Dormouse's story

soup.p

#

The Dormouse's story

soup.a （注：仅仅返回第一个结果）

soup.find_all('a') （find_all 可以返回所有）

find还可以按属性查找

soup.find(id="link3")

要取某个标签的某个属性，可用函数有 find_all,get

for link in soup.find_all('a'):

print(link.get('href'))

#

#

#

要取html文件中的所有文本，可使用get_text()

print(soup.get_text())

# The Dormouse's story

#

# The Dormouse's story

#

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

#

# ...

如果是打开html文件，语句可用：

soup = BeautifulSoup(open("index.html"))

BeautifulSoup中的Object

tag （对应html中的标签）

tag.attrs (以字典形式返回tag的所有属性）

可以直接对tag的属性进行增、删、改，跟操作字典一样

tag['class'] = 'verybold'

tag['id'] = 1

tag

#

Extremely bold

del tag['class']

del tag['id']

tag

#

Extremely bold

tag['class']

# KeyError: 'class'

print(tag.get('class'))

# None

X.contents (X为标签，可返回标签的内容）

eg.

head_tag = soup.head

head_tag

# The Dormouse's story

head_tag.contents

[The Dormouse's story]

title_tag = head_tag.contents[0]

title_tag

# The Dormouse's story

title_tag.contents

# [u'The Dormouse's story']

解决解析网页出现乱码问题：

import urllib2

2

from

阅读(16595) | 评论(0) | 转发(0) |

1

上一篇：用urllib2 和BeautifulSoup抓取豆瓣电影Top250

下一篇：Python去除String中的空格/换行/回车等

给主人留下些什么吧！~~

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册

Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们