Python beautifulsoup 初使用-UMK_eRain-ChinaUnix博客

Chinaunix首页 | 论坛 | 博客

首页　| 　博文目录　| 　关于我

博客访问： 56883
博文数量： 18
博客积分： 0
博客等级：民兵
技术积分： 145
用户组：普通用户
注册时间： 2017-02-03 22:58

文章分类

全部博文（18）

关于人生那些事儿（2）
Python（5）
Linux（7）
QT（2）
嵌入式（2）
未分配的博文（0）

文章存档

2017年（18）

我的朋友

最近访客

推荐博文

相关博文

Python beautifulsoup 初使用

分类： Python/Ruby

2017-02-17 00:17:43

1、安装beautifulsoup
pip Install beautifulsoup4
pip install lmxl (可能会失败，在网站直接下载whl 包安装 ~gohlke/pythonlibs/#lxml 下载

)
pip install html5lib

点击(此处)折叠或打开

#coding:utf-8
from bs4 import BeautifulSoup
import urllib2
import re
HomePage=""
'''
获取网站的菜单列表地址
'''
def getMenuList():
menulist={}
webdata=urllib2.urlopen(HomePage).read()
soup=BeautifulSoup(webdata,'lxml')
menu=soup.find_all(id='menu')
# print urls
for m in menu:
url=m.find_all('a')
for u in url:
href=u.get('href')
title=u.get_text()
if not re.match(r'http:', href):
href=HomePage+href
print title,href
menulist[title]=href
return menulist
if __name__=="__main__":
menu=getMenuList()
# print menu
url=menu.get(u'福利片')
print url

运行结果：
首页
电影 m/1.html
电视剧 /m/9.html
综艺片 /m/15.html
福利片 /m/16.html
伦理片
图片 /new/n/19.html
小说

/m/16.html

通过浏览网站源码，通过表征字段过滤出自己想要的数据，简单几行代码即可完成，相当简便，进一步学习。

阅读(1150) | 评论(0) | 转发(0) |

0

上一篇：socket select 服务端

下一篇：beautifulsoup 爬取网络视频数据

给主人留下些什么吧！~~

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册

Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们