Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1791931
  • 博文数量: 335
  • 博客积分: 4690
  • 博客等级: 上校
  • 技术积分: 4341
  • 用 户 组: 普通用户
  • 注册时间: 2010-05-08 21:38
个人简介

无聊之人--除了技术,还是技术,你懂得

文章分类

全部博文(335)

文章存档

2016年(29)

2015年(18)

2014年(7)

2013年(86)

2012年(90)

2011年(105)

分类: Python/Ruby

2013-09-05 16:24:13

前面我们介绍了BeautfulSoup,Tag,name,attibutes,NavigableString,现在我们接着我们来详细的探究一下NavigatableString
现在我们来探究Navigating the tree,首先说一下他的本质是字符串,但是由于它本身可能还包含标签,因此,仅仅将其定义为字符串是不够的,现在我们就来看一下
本文使用的例子认为:

点击(此处)折叠或打开

  1. html_doc = """
  2. The Dormouse's story

  3. ">The Dormouse's story



  4. ">Once upon a time there were three little sisters; and their names were

  5. ://example.com/elsie" class="sister" id="link1">Elsie,
  6. ://example.com/lacie" class="sister" id="link2">Lacie and
  7. ://example.com/tillie" class="sister" id="link3">Tillie;
  8. and they lived at the bottom of a well.



  9. ">...


  10. """
下面讨论如何遍历该文档的解析树:
1使用标签TAG

点击(此处)折叠或打开

  1. bstag=soup.head
  2. bstitle=soup.title
  3. bsp=soup.body.p
  4. print bstag ,type(bstag)
  5. print bstitle, type(bstitle)
  6. print bsp ,type(bsp)
结果如下:
The Dormouse's story
The Dormouse's story

The Dormouse's story



下面讨论contents与children,在BS中,将children放在另一个list中,我们称之为contents
看一下TAG head 的children节点信息与 .children

点击(此处)折叠或打开

soup = BeautifulSoup(html_doc)
bshead = soup.head
bscontents = bshead.contents
print '1'*20
print soup.head,type(soup.head)
print '2'*20
print bscontents,type(bscontents),len(bscontents)
print '3'*20
i= 0
for  children in bscontents:
    i=i+1
    print i,children,type(children)

print '4'*20
bsheadchild = bshead.children
print bsheadchild,type(bsheadchild)
i= 0
for  children in bsheadchild:
    i=i+1
    print i,children,type(children)

结果如下:
11111111111111111111
The Dormouse's story
22222222222222222222
[The Dormouse's story] 1
33333333333333333333
1 The Dormouse's story
44444444444444444444

1 The Dormouse's story

正如你所看到的那样,.contents与.children得到的该TAG的children,即直接后代,如果你还想获得该TAG children的children,

点击(此处)折叠或打开

  1. soup = BeautifulSoup(html_doc)
  2. bshead = soup.head
  3. bsdescendants = bshead.descendants
  4. print bshead,type(bshead)
  5. print bsdescendants,type(bsdescendants)
  6. i = 0
  7. for descendants in bsdescendants:
  8.     i = i+1
  9.     print i,descendants,type(descendants)
输出的结果如下:
The Dormouse's story

1 The Dormouse's story
2 The Dormouse's story
从上面的代码可以看到,不论是使用.children,.contents,.descendants,都可以获取后代对象的信息,关键可以喜欢那种方式,同时,我们
还可以看到TAG的children可以是TAG或是NavigableString,
这里在BS中定义了一种默认情况,即一个TAG下定义了一个TAG1,而TAG1定义了自己的children,NavigableString,那么TAG的NavigableString与TAG1相同。


点击(此处)折叠或打开

  1. soup = BeautifulSoup(html_doc)
  2. bshead = soup.head

  3. print bshead.string
  4. print bshead.title.string
output:
The Dormouse's story
The Dormouse's story
这仅仅对一个TAG包含一个children的情况,如果包含多个,则为NONE,

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None
如:


点击(此处)折叠或打开

  1. from bs4 import BeautifulSoup
  2. html_doc = """
  3. <child>The Dormouse's story</child>


  4. """

  5. soup = BeautifulSoup(html_doc)
  6. bshead = soup.head

  7. print bshead.string
  8. print bshead.title.string
  9. print bshead.title.child.string
output:

None
The Dormouse's story
The Dormouse's story
接着我们来研究一把body的children情况,用同样的方法,看到结果你是不是丈二和尚摸不着头脑了~~~~~~~~~~~~~

点击(此处)折叠或打开

  1. soup = BeautifulSoup(html_doc)
  2. bsbody= soup.body
  3. i = 0
  4. bsbodycontents = bsbody.contents
  5. for children in bsbodycontents:
  6.     i = i+1
  7.     print i,children,type(children)
  8. print '#'*80
  9. i = 0
  10. bstring = soup.body.strings
  11. for string in bstring:
  12.     i = i+1
  13.     print i,repr(string),type(string)
结果:
1

2

The Dormouse's story


3

4

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.


5

6

...


7

################################################################################
1 u'\n'
2 u"The Dormouse's story"
3 u'\n'
4 u'Once upon a time there were three little sisters; and their names were\n'
5 u'Elsie'
6 u',\n'
7 u'Lacie'
8 u' and\n'
9 u'Tillie'
10 u';\nand they lived at the bottom of a well.'
11 u'\n'
12 u'...'
13 u'\n'
是不是看的明白了,也就是在我们的body中,有N多的回车,因此导致看到结果和我们的预期可能不同
为此BS提供了strip的方法

点击(此处)折叠或打开

  1. bsbody= soup.body
  2. i = 0
  3. bsbodystripcontents =bsbody.stripped_strings
  4. for string in bsbodystripcontents:
  5.     i = i+1
  6.     print i,repr(string),type(string),string
output:
1 u"The Dormouse's story" The Dormouse's story
2 u'Once upon a time there were three little sisters; and their names were' Once upon a time there were three little sisters; and their names were
3 u'Elsie' Elsie
4 u',' ,
5 u'Lacie' Lacie
6 u'and' and
7 u'Tillie' Tillie
8 u';\nand they lived at the bottom of a well.' ;
and they lived at the bottom of a well.
9 u'...' ...

是不是elegant多了?呵呵
前面的介绍我们是沿着parser TREE从parent TAG到 children TAG,同样在BS中提供的方法远不止这些
你还可以从children TAG到parent TAG,方法就是 .parent,.parents(all parent,parent's parent)
除了直接的children,parent之外还有兄弟节点的访问,方法是.next_sibling,previous_sibling,.next_siblings(all next sibling),previous_siblings(all
previous sibling)
当然,你也可以沿着parse 构建的顺序访问该parser TREE,不过这种方法类似于next_sibling,但是又略有不同

点击(此处)折叠或打开

  1. soup = BeautifulSoup(html_doc)
  2. bshead= soup.find("a", id="link3")
  3. for element in bshead.next_elements:
  4.     print element

Tillie
;
and they lived at the bottom of a well.


...


...


 下面我们讨论,我们比较常用的方法search the parse TREE
方法主要有:find,find_all
该方法接受的参数主要有: TAG,regular expression,A list,function,CSS selector,text
结果集的参数限制: limit  =2


点击(此处)折叠或打开

  1. soup = BeautifulSoup(html_doc)
  2. s1=soup.find_all('a')
  3. s2= soup.find_all(re.compile('a'),limit=2)
  4. ss2 = soup.find_all(re.compile('a'))
  5. s3= soup.find_all(text='example')
  6. s4= soup.find_all(['a','p'])
  7. print s1,type(s1)
  8. print '#'*80
  9. print s2,type(s2)
  10. print '#'*80
  11. print ss2,type(ss2)
  12. print '#'*80
  13. print s3,type(s3)
  14. print '#'*80
  15. print s4,type(s4)
结果输出:

[Elsie, Lacie, Tillie]
################################################################################
[The Dormouse's story, Elsie]
################################################################################
[The Dormouse's story, Elsie, Lacie, Tillie]
################################################################################
[]
################################################################################
[

The Dormouse's story

,

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

, Elsie, Lacie, Tillie,

...

]
是不是感觉比较简单?呵呵,在使用这些方法的时候,需要注意的是,参数是使用TAG搜索,还是还是使用文本内容,以及返回的结果列表,以及如何是使用正则表达式
最后我么介绍一下:

Calling a tag is like calling find_all()

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent:
soup.find_all("a")

soup("a") 

REF:












阅读(4557) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~