Beautiful Soup 帮助文档1 快速入门-chinaboywg-ChinaUnix博客

chinaboy小宝chinaboy007.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

chinaboywg

博客访问： 2922680
博文数量： 348
博客积分： 2907
博客等级：中校
技术积分： 2272
用户组：普通用户
注册时间： 2010-03-12 09:16

个人简介

专注 K8S研究

文章分类

全部博文（348）

elk（2）
docker（5）
error（0）
zabbix（21）
haproxy（2）
linux（11）
redis（2）
lvs（9）
squid（8）
nagios（4）
puppet（6）
html（1）
nginx（45）
apache（3）
mysql（65）
php（0）
python（114）

pycharm（1）

pip（1）

requests（1）

requests（0）

urllib（0）

logging（1）

flask（0）

lib（0）

pyqt4（14）

django（7）

beautifulsoup（11）

scrapy（3）

string（6）

pexpect（4）
shell（19）
linux（25）
other（4）
未分配的博文（2）

文章存档

2019年（22）

2018年（57）

2016年（2）

2015年（27）

2014年（33）

2013年（190）

2011年（3）

2010年（14）

我的朋友

相关博文

Beautiful Soup 帮助文档1 快速入门

分类： Python/Ruby

2013-07-06 00:27:07

原文地址：Beautiful Soup 帮助文档1 快速入门作者：oychw

*在程序中中导入 Beautiful Soup库:

from BeautifulSoup import BeautifulSoup          # For processing HTML
from BeautifulSoup import BeautifulStoneSoup     # For processing XML
import BeautifulSoup 

*Beautiful Soup基本功能的演示：

#!/usr/bin/env python
# -*- coding: gbk -*-
#gtalk： ouyangchongwu#gmail.com
#python qq group: 深圳自动化测试python 113938272

import sys

#设定字符编码为GBK
reload(sys)
sys.setdefaultencoding('gbk')


from BeautifulSoup import BeautifulSoup
import re

doc = ['Page title',
       'This is paragraph one.',
       '
This is paragraph two.',
       '']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()

执行结果：

 
  <br>   Page title<br>  
 
 
  

   This is paragraph
   
    one
   
   .
  

  
   This is paragraph
   
    two
   
   .
  

 



*导航soup的一些方法：

获取第一标签的名字：
>>> soup.contents[0].name
u'html'

获取第一标签的第一个子标签的名字
>>> soup.contents[0].contents[0].name
u'head'

获取父标签的名字
>>> head = soup.contents[0].contents[0]
>>> head.parent.name
u'html'

下一个标签，注意此时head依旧是"Page title"
>>> head.next
Page title

下一个兄弟标签的名字
>>> head.nextSibling.name
u'body'

下一个兄弟标签的名字的第一个子标签
head.nextSibling.contents[0]
This is paragraph one.


下一个兄弟标签的名字的第一个子标签的下一个兄弟标签
>>> head.nextSibling.contents[0].nextSibling
This is paragraph two.


*在soup中查找指定标签或有着指定属性的标签


>>> titleTag = soup.html.head.title
>>> titleTag
Page title

>>> titleTag.string
u'Page title'

>>> len(soup('p'))
2
这里是表示有2个p标签

>>> soup.findAll('p', align="center")
[This is paragraph one.
]

>>> soup.find('p', align="center")
This is paragraph one.


>>> soup('p', align="center")[0]['id']
u'firstpara'

>>> soup.find('p', align=re.compile('^b.*'))['id']
u'secondpara'

>>> soup.find('p').b.string
u'one'

>>> soup('p')[1].b.string
u'two'


*修改soup

为title增加id	
>>> titleTag['id'] = 'theTitle'

修改title
>>> titleTag.contents[0].replaceWith("New title")

>>> soup.html.head
New title

去掉第一个p标签
>>> soup.p.extract()
This is paragraph one.


>>> print soup.prettify()

 
  <br>   New title<br>  
 
 
  
   This is paragraph
   
    two
   
   .
  

 


标签互换
>>> soup.p.replaceWith(soup.b)
>>> print soup.prettify()

 
  <br>   New title<br>  
 
 
  
   two
  
 


>>> soup.body.insert(0, "This page used to have ")
>>> soup.body.insert(2, " <p> tags!")

>>> soup.body
This page used to have This page used to have  <p> tags!two
>>> 



*应用实例：抓取一个网页的所有链接：

#!/usr/bin/env python
# -*- coding: gbk -*-
#gtalk： ouyangchongwu#gmail.com
#python qq group: 深圳自动化测试python 113938272

import sys

#设定字符编码为GBK
reload(sys)
sys.setdefaultencoding('gbk')

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://blog.chinaunix.net/u/21908/")
soup = BeautifulSoup(page)
for incident in soup('a'):
    print incident['href']

以上用正则表达式也是可以实现的，只不过使用BeautifulSoup不要去构造匹配字符串。

阅读(1024) | 评论(0) | 转发(0) |

上一篇：python 解析html之BeautifulSoup

下一篇：urllib2库的使用细节

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6