mechanize-yaoshiyan-ChinaUnix博客

Yaoyao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

yaoshiyan

博客访问： 557791
博文数量： 142
博客积分： 2966
博客等级：少校
技术积分： 1477
用户组：普通用户
注册时间： 2009-12-07 22:37

文章分类

全部博文（142）

web（0）
互联网趣事（2）
【NoSQL】（5）
【Windows Mobile（1）
【生活娱乐】（3）
【perl】（1）
【nagios】（2）
【apache】（0）
【php】（12）
【python】（65）
【asterisk】（2）
【mysql】（9）
【linux】（18）
【shell】（21）
未分配的博文（1）

文章存档

2013年（3）

2012年（21）

2011年（53）

2010年（33）

2009年（32）

我的朋友

相关博文

mechanize

分类： Python/Ruby

2012-10-10 18:52:03

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module.

mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:
- any URL can be opened, not just http:
- mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener().
Easy HTML form filling.
Convenient link parsing and following.
Browser history (.back() and .reload() methods).
The Referer HTTP header is added properly (optional).
Automatic observance of .
Automatic handling of HTTP-Equiv and Refresh.

Examples

The examples below are written for a website that does not exist (example.com), so cannot be run. There are also some working examples that you can run.

import re
import mechanize

br = mechanize.Browser()
br.open("")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body

br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm.
br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__)
# Submit current form. Browser calls .close() on the current response on
# navigation, so this closes response1
response2 = br.submit()

# print currently selected form (don't call .submit() on this, use br.submit())
print br.form

response3 = br.back() # back to cheese shop (same data as response1)
# the history mechanism returns cached response objects
# we can still use the response, even though it was .close()d
response3.get_data() # like .seek(0) followed by .read()
response4 = br.reload() # fetches from server

for form in br.forms():
print form
# .links() optionally accepts the keyword args of .follow_/.find_link()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()

You may control the browser’s policy by using the methods of mechanize.Browser’s base class, mechanize.UserAgent. For example:

br = mechanize.Browser()
# Explicitly configure proxies (Browser will attempt to set good defaults).
# Note the userinfo ("joe:password@") and port number (":3128") are optional.
br.set_proxies({"http": "joe:password@myproxy.example.com:3128",
"ftp": "proxy.example.com",
})
# Add HTTP Basic/Digest auth username and password for HTTP proxy access.
# (equivalent to using "joe:password@..." form above)
br.add_proxy_password("joe", "password")
# Add HTTP Basic/Digest auth username and password for website access.
br.add_password("", "joe", "password")
# Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML).
br.set_handle_equiv(False)
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)
# Don't add Referer (sic) header
br.set_handle_referer(False)
# Don't handle Refresh redirections
br.set_handle_refresh(False)
# Don't handle cookies
br.set_cookiejar()
# Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by
# default: no need to do this unless you have some reason to use a
# particular cookiejar)
br.set_cookiejar(cj)
# Log information about HTTP redirects and Refreshes.
br.set_debug_redirects(True)
# Log HTTP response bodies (ie. the HTML, most of the time).
br.set_debug_responses(True)
# Print HTTP headers.
br.set_debug_http(True)

# To make sure you're seeing all debug output:
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)

# Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("

阅读(970) | 评论(0) | 转发(0) |

上一篇：php header()使用说明

下一篇：ajax跨域

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6