python 中BeautifulSoup入门-bjutslg-ChinaUnix博客

活出自我

首页　| 　博文目录　| 　关于我

bjutslg

博客访问： 1693879
博文数量： 695
博客积分： 0
博客等级：民兵
技术积分： 4027
用户组：普通用户
注册时间： 2013-11-20 21:22

文章分类

全部博文（695）

java多线程（1）
dubbo（1）
java面试（31）
spring（3）
redis（2）
Netty学习（7）
java并发编程读书（8）
学习网络（3）
比特币（1）
HCNA（5）
Windows（3）
MQ（3）
密码学（5）
逆向汇编（2）
nmap学习（2）
缓冲区溢出（3）
CTF训练（1）
Kali学习（8）
java（85）

SSH框架学习（1）

Tomcat（4）
docker学习（4）
wireshark学习（4）
Metaploit学习（4）
系统服务器配置（1）
加密算法（2）
sqlmap学习（4）
PHP 学习（1）
数据库（8）
正则表达式（0）
SQL注入（4）
Python（26）

Django学习（1）
Webgoat学习（1）
渗透学习（45）
开源资源（1）
设计模式（19）
招聘（1）
排序整理（9）
网络安全（35）
笔试面试（65）
逻辑（2）
收藏（3）
一天一算法（8）
数据结构和算法（10）
程序人生（5）
Shell（8）
书荐（4）
虚拟化技术（1）
c/c++学习（96）
Linux学习（67）
英语（2）
Unix网络和TCP/IP（81）
未分配的博文（0）

文章存档

2018年（18）

2017年（74）

2016年（170）

2015年（102）

2014年（276）

2013年（55）

我的朋友

相关博文

python 中BeautifulSoup入门

分类： Python/Ruby

2015-10-09 11:23:39

在前面的例子用，我用了BeautifulSoup来从58同城抓取了手机维修的店铺信息，这个库使用起来的确是很方便的。本文是BeautifulSoup 的一个详细的介绍，算是入门把。文档地址： /

什么是BeautifulSoup？

<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href=" class="sister" id="link1">Elsie</a>,
<a href=" class="sister" id="link2">Lacie</a> and
<a href=" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...

"""

soup = BeautifulSoup(html_doc)

print soup.title

print soup.title.name

print soup.title.string

print soup.p

print soup.a

print soup.find_all('a')

print soup.find(id='link3')

print soup.get_text()

结果为：

<title>The Dormouse's story</title>
title
The Dormouse's story
The Dormouse's story
<a class="sister" href=" id="link1">Elsie</a>
[<a class="sister" href=" id="link1">Elsie</a>, <a class="sister" href=" id="link2">Lacie</a>, <a class="sister" href=" id="link3">Tillie</a>]
<a class="sister" href=" id="link3">Tillie</a>

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

可以看出：soup 就是BeautifulSoup处理格式化后的字符串，soup.title 得到的是title标签，soup.p 得到的是文档中的第一个p标签，要想得到所有标签，得用find_all

函数。find_all 函数返回的是一个序列，可以对它进行循环，依次得到想到的东西.

get_text() 是返回文本,这个对每一个BeautifulSoup处理后的对象得到的标签都是生效的。你可以试试 print soup.p.get_text()

其实是可以获得标签的其他属性的，比如我要获得a标签的href属性的值，可以使用 print soup.a['href'],类似的其他属性，比如class也是可以这么得到的（soup.a['class']）。

特别的，一些特殊的标签，比如head标签，是可以通过soup.head 得到，其实前面也已经说了。

如何获得标签的内容数组？使用contents 属性就可以比如使用 print soup.head.contents，就获得了head下的所有子孩子，以列表的形式返回结果，

可以使用 [num] 的形式获得 ,获得标签，使用.name 就可以。

获取标签的孩子，也可以使用children，但是不能print soup.head.children 没有返回列表，返回的是 <listiterator object at 0x108e6d150>,

不过使用list可以将其转化为列表。当然可以使用for 语句遍历里面的孩子。

关于string属性，如果超过一个标签的话，那么就会返回None，否则就返回具体的字符串print soup.title.string 就返回了 The Dormouse's story

超过一个标签的话，可以试用strings

向上查找可以用parent函数，如果查找所有的，那么可以使用parents函数

查找下一个兄弟使用next_sibling,查找上一个兄弟节点使用previous_sibling,如果是查找所有的，那么在对应的函数后面加s就可以

如何遍历树？

　使用find_all 函数

find_all(name, attrs, recursive, text, limit, **kwargs)

举例说明：

print soup.find_all('title')
print soup.find_all('p','title')
print soup.find_all('a')
print soup.find_all(id="link2")
print soup.find_all(id=True)

返回值为：

[<title>The Dormouse's story</title>]
[The Dormouse's story]
[<a class="sister" href=" id="link1">Elsie</a>, <a class="sister" href=" id="link2">Lacie</a>, <a class="sister" href=" id="link3">Tillie</a>]
[<a class="sister" href=" id="link2">Lacie</a>]
[<a class="sister" href=" id="link1">Elsie</a>, <a class="sister" href=" id="link2">Lacie</a>, <a class="sister" href=" id="link3">Tillie</a>]

通过css查找,直接上例子把：

print soup.find_all("a", class_="sister")
print soup.select("p.title")

通过属性进行查找
print soup.find_all("a", attrs={"class": "sister"})

通过文本进行查找
print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])

限制结果个数
print soup.find_all("a", limit=2)

结果为：

[<a class="sister" href=" id="link1">Elsie</a>, <a class="sister" href=" id="link2">Lacie</a>, <a class="sister" href=" id="link3">Tillie</a>]
[The Dormouse's story]
[<a class="sister" href=" id="link1">Elsie</a>, <a class="sister" href=" id="link2">Lacie</a>, <a class="sister" href=" id="link3">Tillie</a>]
[u'Elsie']
[u'Elsie', u'Lacie', u'Tillie']
[<a class="sister" href=" id="link1">Elsie</a>, <a class="sister" href=" id="link2">Lacie</a>]

总之，通过这些函数可以查找到想要的东西。

---end---

阅读(762) | 评论(0) | 转发(0) |

上一篇：PHP autoload实践

下一篇：hibernate之template find方法使用

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6