采集百度top500歌曲，python2.7.2-Larpenteur-ChinaUnix博客

尘世中一个迷途小书童riverhwp.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

Larpenteur

博客访问： 6453557
博文数量： 2759
博客积分： 1021
博客等级：中士
技术积分： 4091
用户组：普通用户
注册时间： 2012-03-11 14:14

文章分类

全部博文（2759）

Todo（1）
Advice（151）
Linux-未分类（223）
Ubuntu（47）
Database（145）
算法&DS（77）
Android（47）
Web（214）
Geek（237）
CPPC（296）
Java（113）
Python（99）
Matlab（19）
Git（19）
SVN（11）
Gnuplot（5）
面试（0）
机器-挖掘-AI（6）
开源项目（1）
Happy Drawe（9）
Programming（144）

Tools（23）

Shell（66）

Makefile（11）

GDB（26）

vim（18）
System（628）

Author（110）

Common（4）

Memory（66）

File system（82）

Driver（19）

IO（66）

Storage（45）

General（38）

Architecture（19）

Command（64）

Kernel（115）
Virtualization（39）
Cloud（33）
Hadoop（71）
Big Data（24）
未分配的博文（100）

文章存档

2019年（1）

2017年（84）

2016年（196）

2015年（204）

2014年（636）

2013年（1176）

2012年（463）

我的朋友

最近访客

推荐博文

采集百度top500歌曲，python2.7.2

分类：

2012-12-22 10:38:43

原文地址：采集百度top500歌曲，python2.7.2 作者：ztguang

http://blog.b999.net/post/141/

#-*- coding: UTF-8 -*-
'''
Created on 2012-3-8

@author: tiantian

Modify: 2012-4-15
The correct save to file in windows
'''
import urllib
import re
import platform
import os

top500 = ''
#top500 = ''

songs = []

if (os.path.exists('songs')== False):
os.mkdir('songs')

def main():

divr = '

.*?.*?

'
    mf = urllib.urlopen(top500)
    content = mf.read()
    content = content.decode('gbk')

    content = re.sub('\n+',' ',content)
    alldiv = re.findall(divr,content)
    i =0
    for div in alldiv:
        ulr = ''
        allul = re.findall(ulr,div)

        for ul in allul:
            lir = ''
            allli = re.findall(lir,ul)

            for li in allli:
                if i<245:
                    i = i+1
                    continue
                i = i+1
                songName = '

.*?(.*?).*?

'
name = re.findall(songName,li)
songAuthor = '

.*?(.*?).*?

'
                author = re.findall(songAuthor,li)

                songs.append([name[0],author[0]])

                songUrl = getSongUrl(name[0],author[0])

                sysstr = platform.system()
                if(sysstr =="Windows"):
                 filename = ('songs/'+name[0]+'-'+author[0]+'.mp3').encode('gbk')
                elif(sysstr == "Linux"):
                 filename = 'songs/'+name[0]+'-'+author[0]+'.mp3'
                else:
                 print ("Other System tasks")
                print filename

                try:
                    urllib.urlretrieve(songUrl,filename)
                    # 异常检查并不能判断是否下载成功，需要进行其他判断
                    print i,name[0],author[0],'下载成功'

                except Exception :
                    print i,name[0],author[0],'没下载成功'

def getSongUrl(songName,authorName):
    '''这里由于歌曲名称和作者名称的不完整，可能导致无法得到url，'''
    songUrl = '%s$$%s$$$$&url=&listenreelect=0&.r=0.1696378872729838' % (urllib.quote(songName.encode('gbk')),urllib.quote(authorName.encode('gbk')))
    f = urllib.urlopen(songUrl)
    c = f.read()
    url1 = re.findall('.*?CDATA\[(.*?)\]].*?',c)
    url2 = re.findall('.*?CDATA\[(.*?)\]].*?',c)
    if len(url1) <1:
        return ''

    try:
        return url1[0][:url1[0].rindex('/')+1] + url2[0]
    except Exception:
        return url1[0]

if __name__ == '__main__':
    main()

采集的mp3文件保存在新建的目录 songs下

阅读(735) | 评论(0) | 转发(0) |

上一篇：python 百度top100和top500歌曲下载

下一篇：50个能够满足Python所有需要模块的站点

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6