PYTHON 文本处理之准备工作--Beautiful Soup(1)-kinfinger-ChinaUnix博客

kinfingerasage.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

kinfinger

博客访问： 1819821
博文数量： 335
博客积分： 4690
博客等级：上校
技术积分： 4341
用户组：普通用户
注册时间： 2010-05-08 21:38

个人简介

无聊之人--除了技术，还是技术，你懂得

文章分类

全部博文（335）

ORACEL（4）
REDIS（1）
REDIS（0）
LINUX/UNIX（4）
PHP（1）
COGNOS（1）
COBOL（4）
CICS（3）
EXCEL（2）
MYSQL（7）
DB2（72）
TWS（0）
SA（0）
mainframe（16）
web（10）

javascript（1）
APUE（43）
REXX（7）
work（13）
life（13）
python（95）
c/c++（26）
asage（7）
未分配的博文（6）

文章存档

2016年（29）

2015年（18）

2014年（7）

2013年（86）

2012年（90）

2011年（105）

我的朋友

相关博文

PYTHON 文本处理之准备工作--Beautiful Soup(1)

分类： Python/Ruby

2013-09-05 16:24:13

前面我们介绍了BeautfulSoup,Tag,name,attibutes,NavigableString,现在我们接着我们来详细的探究一下NavigatableString
现在我们来探究Navigating the tree，首先说一下他的本质是字符串，但是由于它本身可能还包含标签，因此，仅仅将其定义为字符串是不够的，现在我们就来看一下
本文使用的例子认为：

点击(此处)折叠或打开

html_doc = """
The Dormouse's story
">The Dormouse's story
">Once upon a time there were three little sisters; and their names were
://example.com/elsie" class="sister" id="link1">Elsie,
://example.com/lacie" class="sister" id="link2">Lacie and
://example.com/tillie" class="sister" id="link3">Tillie;
and they lived at the bottom of a well.
">...
"""

下面讨论如何遍历该文档的解析树：
1使用标签TAG

点击(此处)折叠或打开

bstag=soup.head
bstitle=soup.title
bsp=soup.body.p
print bstag ,type(bstag)
print bstitle, type(bstitle)
print bsp ,type(bsp)

结果如下：
The Dormouse's story
The Dormouse's story

The Dormouse's story

下面讨论contents与children，在BS中，将children放在另一个list中，我们称之为contents
看一下TAG head 的children节点信息与 .children

点击(此处)折叠或打开

soup = BeautifulSoup(html_doc)
bshead = soup.head
bscontents = bshead.contents
print '1'*20
print soup.head,type(soup.head)
print '2'*20
print bscontents,type(bscontents),len(bscontents)
print '3'*20
i= 0
for children in bscontents:
i=i+1
print i,children,type(children)

print '4'*20
bsheadchild = bshead.children
print bsheadchild,type(bsheadchild)
i= 0
for children in bsheadchild:
i=i+1
print i,children,type(children)

结果如下：
11111111111111111111
The Dormouse's story
22222222222222222222
[The Dormouse's story] 1
33333333333333333333
1 The Dormouse's story
44444444444444444444

1 The Dormouse's story

正如你所看到的那样，.contents与.children得到的该TAG的children，即直接后代，如果你还想获得该TAG children的children，

点击(此处)折叠或打开

soup = BeautifulSoup(html_doc)
bshead = soup.head
bsdescendants = bshead.descendants
print bshead,type(bshead)
print bsdescendants,type(bsdescendants)
i = 0
for descendants in bsdescendants:
i = i+1
print i,descendants,type(descendants)

输出的结果如下：
The Dormouse's story

1 The Dormouse's story
2 The Dormouse's story
从上面的代码可以看到，不论是使用.children,.contents,.descendants,都可以获取后代对象的信息，关键可以喜欢那种方式，同时，我们
还可以看到TAG的children可以是TAG或是NavigableString，
这里在BS中定义了一种默认情况，即一个TAG下定义了一个TAG1，而TAG1定义了自己的children，NavigableString，那么TAG的NavigableString与TAG1相同。
即

点击(此处)折叠或打开

soup = BeautifulSoup(html_doc)
bshead = soup.head
print bshead.string
print bshead.title.string

output：
The Dormouse's story
The Dormouse's story
这仅仅对一个TAG包含一个children的情况，如果包含多个，则为NONE，

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None
如：

点击(此处)折叠或打开

from bs4 import BeautifulSoup
html_doc = """
<child>The Dormouse's story</child>
"""
soup = BeautifulSoup(html_doc)
bshead = soup.head
print bshead.string
print bshead.title.string
print bshead.title.child.string

output：

None
The Dormouse's story
The Dormouse's story
接着我们来研究一把body的children情况，用同样的方法，看到结果你是不是丈二和尚摸不着头脑了~~~~~~~~~~~~~

点击(此处)折叠或打开

soup = BeautifulSoup(html_doc)
bsbody= soup.body
i = 0
bsbodycontents = bsbody.contents
for children in bsbodycontents:
i = i+1
print i,children,type(children)
print '#'*80
i = 0
bstring = soup.body.strings
for string in bstring:
i = i+1
print i,repr(string),type(string)

结果：
1

2

The Dormouse's story

3

4

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

5

6

...

7

################################################################################
1 u'\n'
2 u"The Dormouse's story"
3 u'\n'
4 u'Once upon a time there were three little sisters; and their names were\n'
5 u'Elsie'
6 u',\n'
7 u'Lacie'
8 u' and\n'
9 u'Tillie'
10 u';\nand they lived at the bottom of a well.'
11 u'\n'
12 u'...'
13 u'\n'
是不是看的明白了，也就是在我们的body中，有N多的回车，因此导致看到结果和我们的预期可能不同
为此BS提供了strip的方法

点击(此处)折叠或打开

bsbody= soup.body
i = 0
bsbodystripcontents =bsbody.stripped_strings
for string in bsbodystripcontents:
i = i+1
print i,repr(string),type(string),string

output：
1 u"The Dormouse's story" The Dormouse's story
2 u'Once upon a time there were three little sisters; and their names were' Once upon a time there were three little sisters; and their names were
3 u'Elsie' Elsie
4 u',' ,
5 u'Lacie' Lacie
6 u'and' and
7 u'Tillie' Tillie
8 u';\nand they lived at the bottom of a well.' ;
and they lived at the bottom of a well.
9 u'...' ...

是不是elegant多了？呵呵
前面的介绍我们是沿着parser TREE从parent TAG到 children TAG，同样在BS中提供的方法远不止这些
你还可以从children TAG到parent TAG，方法就是 .parent,.parents(all parent,parent's parent)
除了直接的children，parent之外还有兄弟节点的访问，方法是.next_sibling,previous_sibling,.next_siblings(all next sibling),previous_siblings(all
previous sibling)
当然，你也可以沿着parse 构建的顺序访问该parser TREE，不过这种方法类似于next_sibling，但是又略有不同

点击(此处)折叠或打开

soup = BeautifulSoup(html_doc)
bshead= soup.find("a", id="link3")
for element in bshead.next_elements:
print element

：

Tillie
;
and they lived at the bottom of a well.

...

下面我们讨论，我们比较常用的方法search the parse TREE
方法主要有：find，find_all
该方法接受的参数主要有： TAG，regular expression，A list，function，CSS selector,text
结果集的参数限制： limit =2

点击(此处)折叠或打开

soup = BeautifulSoup(html_doc)
s1=soup.find_all('a')
s2= soup.find_all(re.compile('a'),limit=2)
ss2 = soup.find_all(re.compile('a'))
s3= soup.find_all(text='example')
s4= soup.find_all(['a','p'])
print s1,type(s1)
print '#'*80
print s2,type(s2)
print '#'*80
print ss2,type(ss2)
print '#'*80
print s3,type(s3)
print '#'*80
print s4,type(s4)

结果输出：

[Elsie, Lacie, Tillie]
################################################################################
[The Dormouse's story, Elsie]
################################################################################
[The Dormouse's story, Elsie, Lacie, Tillie]
################################################################################
[]
################################################################################
[

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

, Elsie, Lacie, Tillie,

...

]
是不是感觉比较简单？呵呵，在使用这些方法的时候，需要注意的是，参数是使用TAG搜索，还是还是使用文本内容，以及返回的结果列表，以及如何是使用正则表达式
最后我么介绍一下：

Calling a tag is like calling find_all()

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent:
soup.find_all("a")

soup("a") 

REF：

阅读(4623) | 评论(0) | 转发(0) |

上一篇：LINUX下目标文件的BSS段、数据段、代码段

下一篇：吐槽贴--关于360

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6