Chinaunix首页 | 论坛 | 博客
  • 博客访问: 258888
  • 博文数量: 12
  • 博客积分: 4760
  • 博客等级: 上校
  • 技术积分: 2205
  • 用 户 组: 普通用户
  • 注册时间: 2006-09-11 13:48
文章分类
文章存档

2008年(12)

我的朋友

分类:

2008-05-23 15:26:57

from sgmllib import SGMLParser
class URLLister(SGMLParser):
    def reset(self):                             
        SGMLParser.reset(self)
        self.urls = []
        self.url = []
    def handle_starttag(self,tag,method,attributes):
        href = [v for k, v in attributes if k=='href'] 
        if href:
            self.urls.extend(href)
    def start_a(self,attributes):
        pass
    #def start_td(self,attrs):
    #    pass
    #def start_div(self,attrs):
    #    pass
    #def do_table(self,attrs):
    #    pass
 
 
import urllib                                      
import mp3parser
from sgmllib import SGMLParser
import re
class HtmlParser():
    def __init__(self):
        pass
    def readHtml(self,url):
        sock = urllib.urlopen(url)
        htmlSource = sock.read()                           
        sock.close()
        return htmlSource                                       
    def parserHtml(self,html):
        parser = mp3parser.URLLister()
        parser.feed(html)
        parser.close()  
        return parser.urls
if __name__ == "__main__":
    url = ""
    hp = HtmlParser()
    html = hp.readHtml(url)
    #print html
    urls = hp.parserHtml(html)
阅读(603) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~