Python编写的网页抓取工具，可以从百度空间上自动下载所有文章-alertx-ChinaUnix博客

open source

首页　| 　博文目录　| 　关于我

alertx

博客访问： 208346
博文数量： 48
博客积分： 1935
博客等级：上尉
技术积分： 491
用户组：普通用户
注册时间： 2010-07-29 00:59

文章分类

全部博文（48）

script（24）
未分配的博文（24）

文章存档

2011年（1）

2010年（47）

我的朋友

相关博文

Python编写的网页抓取工具，可以从百度空间上自动下载所有文章

分类： Python/Ruby

2010-09-21 00:44:08

Python编写的网页抓取工具，可以从百度空间上自动下载所有文章

2010-06-12 14:12

简介：
自己编写的Python程序，可以自动的将指定的Baidu空间上的所有文章下载下来。

使用方法很简单，装完Python之后，保存脚本为down.py，然后在这个文件所在的文件夹下执行：

python down.py

其中的为空间地址，例如我的：

代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# *********************************************************************************
# Copyright (C) 2010 yangyingchao@gmail.com

# Author: yangyingchao 

# This program is free software; you can redistribute it and/or modify it under
# the terms of the GNU General Public License as published by the Free Software
# Foundation; either version 2, or (at your option) any later version.

# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
# details.

# You should have received a copy of the GNU General Public License along with
# GNU Emacs; see the file COPYING.  If not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
# *********************************************************************************

import urllib2
import re
import sys

pattern = ""
reg_tail = ""
username = ""

def downURL(url, filename):
    print "Download %s, save as %s"%(url, filename)
    try:
        fp = urllib2.urlopen(url)
    except:
        print "download exception"
        return 0
    op = open(filename, "wb")
    while 1:
        s = fp.read()
        if not s:
            break
        op.write(s)
    fp.close( )
    op.close( )
    return 1

def getURL(url):
    print "Parsing %s"%url
    try:
        fp = urllib2.urlopen(url)
        contents = fp.readlines()
    except:
        print "exception"
        return []

    item_list = []
    for s in contents:
        urls = pattern.findall(s)
        if urls:
            item_list.extend(urls)
        fp.close( )
    return item_list

def reptile(base_url):
    """
    Download all articles from base_url.
    Arguments:
    - `base_url`: Url of website.
    """
    page_list = []
    base_page = base_url.rstrip("/")+"/blog/index/"
    sign_tail = u"尾页"
    tail = ""
    total_page = 10
    global username
    print username

    try:
        fp = urllib2.urlopen(base_page+"0")
    except:
        print "%s: Not such url"%page
        print sys.exc_info()
    else:
        for s in fp.readlines():
            if sign_tail in s.decode("gbk"):
                tail = s.decode("gbk")
                break
        fp.close()

    if tail:
        pos = tail.rfind(u"尾页")
        total_page =int(tail[:pos-3].split("/")[-1])

    output_list = [ ]
    for idx in range(total_page+1):
        item_page = "%s%d"%(base_page, idx)
        item_list = getURL(item_page)
        if item_list:
            output_list.extend(item_list)

    item_list = list(set(output_list))
    for item in item_list:
        down_url = item.replace("/%s"%username,
                                "%s"%username)
        local_file = down_url.split("/")[-1]
        ret = downURL(down_url,local_file)
    print "Total: %d articles."%(len(item_list))
    pass

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Usage: %s url of baidu space"%sys.argv[0]
        print "Such as: %s Username"
        sys.exit(1)
    base_url = sys.argv[1]
    if not base_url.startswith(""):
        print "Wrong Type of URL??", "It works on Baidu Space only."
        sys.exit(1)

    username = base_url.rstrip("/").split("/")[-1]
    print username

    reg_tail = re.compile(u"%s.*?尾页"%username)
    pattern = re.compile("/%s/blog/item/.*?\.html"%username)

    reptile (base_url)

Author:yangyingchao, 2010-06-12

阅读(1110) | 评论(0) | 转发(0) |

上一篇：python下载songtaste歌曲（终端版本）

下一篇：用python正则表达式提取网页的url

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6