最近在搞爬虫,然后得知webkit是个爬虫最终利器,然后开始去了解了一下pywebkit。
发现国内很少这方面的资源,国外的话搜寻了好久也没搜到比较好的。
官网上的文档貌似也没找着,然后看了下官网给的例子,结果还是没找着想要的。
然后东凑西拼了一下,找到了一个返回html的类,如下:
-
class WebView(webkit.WebView):
-
def get_html(self):
-
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
-
html = self.get_main_frame().get_title()
-
self.execute_script('document.title=oldtitle;')
-
return html
-
然后自己弄了几个webview的对象出来,open了一下一个url,然后调用get_html()发现啥都木有
-
web = WebView()
-
web.open(url)
-
html = web.get_html()
-
-
print html
-
print str(html)
后来又google了一番,终于找到怎么调用这个类的方法:
-
#!/usr/bin/env python
-
import sys, threads # kudos to Nicholas Herriot (see comments)
-
import gtk
-
import webkit
-
import warnings
-
from time import sleep
-
from optparse import OptionParser
-
-
warnings.filterwarnings('ignore')
-
-
class WebView(webkit.WebView):
-
def get_html(self):
-
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
-
html = self.get_main_frame().get_title()
-
self.execute_script('document.title=oldtitle;')
-
return html
-
-
class Crawler(gtk.Window):
-
def __init__(self, url, file):
-
gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala
-
gtk.Window.__init__(self)
-
self._url = url
-
self._file = file
-
-
def crawl(self):
-
view = WebView()
-
view.open(self._url)
-
view.connect('load-finished', self._finished_loading)
-
self.add(view)
-
gtk.main()
-
-
def _finished_loading(self, view, frame):
-
with open(self._file, 'w') as f:
-
f.write(view.get_html())
-
gtk.main_quit()
-
-
def main():
-
options = get_cmd_options()
-
crawler = Crawler(options.url, options.file)
-
crawler.crawl()
-
-
def get_cmd_options():
-
"""
-
gets and validates the input from the command line
-
"""
-
usage = "usage: %prog [options] args"
-
parser = OptionParser(usage)
-
parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
-
parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
-
-
(options,args) = parser.parse_args()
-
-
if not options.url:
-
print 'You must specify an URL.',sys.argv[0],'--help for more details'
-
exit(1)
-
if not options.file:
-
print 'You must specify a destination file.',sys.argv[0],'--help for more details'
-
exit(1)
-
-
return options
-
-
if __name__ == '__main__':
-
main()
Ok,html到手
阅读(4859) | 评论(0) | 转发(0) |