python网页爬虫之列车时刻表的抓取(3)-车次数据的解析-alertx-ChinaUnix博客

open source

首页　| 　博文目录　| 　关于我

alertx

博客访问： 208230
博文数量： 48
博客积分： 1935
博客等级：上尉
技术积分： 491
用户组：普通用户
注册时间： 2010-07-29 00:59

文章分类

全部博文（48）

script（24）
未分配的博文（24）

文章存档

2011年（1）

2010年（47）

我的朋友

相关博文

python网页爬虫之列车时刻表的抓取(3)-车次数据的解析

分类： Python/Ruby

2010-09-21 00:52:55

python网页爬虫之列车时刻表的抓取(3)-车次数据的解析

2010-05-30 18:16

得到了车次详细数据的链接后,可以到达车次详细数据的页面.这个页面分成了两块,车次数据和途经车站的数据.

先来解析车次数据:

    # 循环处理每个车次的时刻表
    for oneTrain in trains:
        pageContent=urllib2.urlopen(oneTrain).read().replace("gb2312","gb18030") # 由于有些站名是生僻字,用gb2312做解析的话会出错,好像是python对gb2312支持的问题.
        # 列车信息
        trainInfo = pq(pageContent)('body center div.ResultContent div.ResultContentLeft div.ResultContentLeftContent div.ResultTrainCodeContent table').eq(1)
        trainInfos = trainInfo.map(getTrainInfo) # 在getTrainInfo里解析出了车次数据
        for oneTrainInfo in trainInfos:# 这段是把车次数据写进数据里
            insertTrainInfo(oneTrainInfo,c)
            conn.commit()

解析出车次数据(getTrainInfo):

def getTrainInfo(i,e):
    result=[]
    # 取车次
    trainNumbers=pq(e)('tr td').eq(2).text().split(' ')
    for oneTNum in trainNumbers:
        result.append([oneTNum])
    # 运行时间
    runtimeMeta=pq(e)('tr td').eq(4).text().split(' ')
    for (counter,oneRuntime) in enumerate(runtimeMeta):
        runtime = int(oneRuntime.split(u'时')[0])*60+int(oneRuntime.split(u'时')[1][0:-1])
        result[counter].append(runtime)
    # 始发站
    startingStations=pq(e)('tr').eq(1)('td').eq(1).text().split(' ')
    for (counter,oneStartingStation) in enumerate(startingStations):
        result[counter].append(oneStartingStation)
    # 终点站
    terminatingStations=pq(e)('tr').eq(1)('td').eq(3).text().split(' ')
    for (counter,oneTerminatingStation) in enumerate(terminatingStations):
        result[counter].append(oneTerminatingStation)
    # 始发时间
    departureTimes=pq(e)('tr').eq(2)('td').eq(1).text().split(' ')
    for (counter,oneDepartureTime) in enumerate(departureTimes):
        result[counter].append(datetime.strptime(oneDepartureTime,'%H:%M'))
    # 到达时间
    arrivalTimes=pq(e)('tr').eq(2)('td').eq(3).text().split(' ')
    for (counter,oneArrivalTime) in enumerate(arrivalTimes):
        result[counter].append(datetime.strptime(oneArrivalTime,'%H:%M'))
    # 类型
    clazzes=pq(e)('tr').eq(3)('td').eq(1).text().split(' ')
    for (counter,oneClazz) in enumerate(clazzes):
        result[counter].append(oneClazz)
    # 全程
    ranges=pq(e)('tr').eq(3)('td').eq(3).text().split(' ')
    for (counter,oneRange) in enumerate(ranges):
        result[counter].append(int(oneRange[0:-2]))
    return result

因为一些特殊页面(比如1116A)里,一个页面里会放两个车次的数据只好做成支持存放多个车次数据了.

果然不是一般的恶心,恶心的东西就用恶心的办法来对付吧....

解析途经车站数据:

        # 列车途经站点
        trainSchedule = pq(pageContent)('body center div.ResultContent div.ResultContentLeft div.ResultContentLeftContent\
div.ResultTrainCodeContent table').eq(2)('tr')
        trainSchedules = trainSchedule.map(getScheduleInfo)
        global witchTrain # 有点无奈的选择
        witchTrain = 0
        for oneTrainSchedule in trainSchedules:
            insertTrainSchedule(trainInfos,oneTrainSchedule,c)
            conn.commit()

========getScheduleInfo

def getScheduleInfo(i,e):
    global witchTrain # 所属当前页面列车的序列号
    td = pq(e)('td')
    if td.eq(0).text() in ('No.',""):
        witchTrain += 1
        return
    # 解析异常处理
    if len(td) == 2:
        logger.error("%s:%s"%(td.text().encode('gb18030'),len(td)))
        return
    # 停车时间
    stopTime = datetime.strptime("00:00",'%H:%M')
    try:
        stopTime = datetime.strptime(td.eq(5).text(),'%H:%M')
    except Exception:
        #print 'stop time parse error:%s:%s'%(td.eq(5).text(),td.eq(0).text())
        None
    # 开车时间
    startTime = datetime.strptime("00:00",'%H:%M')
    try:
        startTime = datetime.strptime(td.eq(6).text(),'%H:%M')
    except Exception:
        #print 'start time parse error:%s:%s'%(td.eq(6).text(),td.eq(0).text())
        None
    # 里程
    range = int(td.eq(7).text()[:-2])
    # 硬座
    hardSeatPrice=0.0
    if td.eq(8) and len(td.eq(8).text()) > 1:
        hardSeatPrice=td.eq(8).text()[:-1]
    # 硬卧中铺
    hardBerthPrice=0.0
    if td.eq(9) and len(td.eq(9).text()) > 1 and td.eq(9).text()[:-1] <> "-":
        hardBerthPrice=td.eq(9).text()[:-1]
    # 软座
    softSeatPrice=0.0
    if td.eq(10) and len(td.eq(10).text()) > 1:
        softSeatPrice=td.eq(10).text()[:-1]
    # 软卧下铺
    softBerthPrice='0'
    if td.eq(11) and len(td.eq(11).text()) > 1:
        softBerth=td.eq(11).text()[:-1]
    return [[witchTrain,
            int(td.eq(0).text()),td.eq(1).text(),
            td.eq(4).text(),stopTime,
            startTime,range,
            hardSeatPrice,hardBerthPrice,
            softSeatPrice,softBerthPrice,]] # 如果不加套一个List,里面的东东会被当成一个串,搞不懂是为什么

阅读(1667) | 评论(0) | 转发(0) |

上一篇：文本处理、词典制作、格式转换教程（python快速入门应用）

下一篇：python网页爬虫之列车时刻表的抓取(2)-铁路网页面的解析

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6