Python有了asyncio和aiohttp在爬虫这类型IO任务中多线程/多进程还有存在的必要吗？-delphi2007-ChinaUnix博客

delphi 教程

首页　| 　博文目录　| 　关于我

delphi2007

博客访问： 1279373
博文数量： 788
博客积分： 4000
博客等级：上校
技术积分： 7005
用户组：普通用户
注册时间： 2008-08-19 15:52

文章分类

全部博文（788）

文章存档

2017年（81）

2011年（1）

2009年（369）

2008年（337）

推荐博文

代码

asyncio+aiohttp

import aiohttp


async def fetch_async(a):
    async with aiohttp.request('GET', URL.format(a)) as r:
        data = await r.json()
    return data['args']['a']
    
start = time.time()
event_loop = asyncio.get_event_loop()
tasks = [fetch_async(num) for num in NUMBERS]
results = event_loop.run_until_complete(asyncio.gather(*tasks))

for num, result in zip(NUMBERS, results):
    print('fetch({}) = {}'.format(num, result))

asyncio+aiohttp+线程池比上面要慢1秒

async def fetch_async(a):
    async with aiohttp.request('GET', URL.format(a)) as r:
        data = await r.json()
    return a, data['args']['a']


def sub_loop(numbers):
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    tasks = [fetch_async(num) for num in numbers]
    results = loop.run_until_complete(asyncio.gather(*tasks))
    for num, result in results:
        print('fetch({}) = {}'.format(num, result))


async def run(executor, numbers):
    await asyncio.get_event_loop().run_in_executor(executor, sub_loop, numbers)


def chunks(l, size):
    n = math.ceil(len(l) / size)
    for i in range(0, len(l), n):
        yield l[i:i + n]                                                     

event_loop = asyncio.get_event_loop()
tasks = [run(executor, chunked) for chunked in chunks(NUMBERS, 3)]
results = event_loop.run_until_complete(asyncio.gather(*tasks))

print('Use asyncio+aiohttp+ThreadPoolExecutor cost: {}'.format(time.time() - start))

传统的requests + ThreadPoolExecutor比上面慢了3倍

import time
import requests
from concurrent.futures import ThreadPoolExecutor

NUMBERS = range(12)
URL = '{}'

def fetch(a):
    r = requests.get(URL.format(a))
    return r.json()['args']['a']

start = time.time()
with ThreadPoolExecutor(max_workers=3) as executor:
    for num, result in zip(NUMBERS, executor.map(fetch, NUMBERS)):
        print('fetch({}) = {}'.format(num, result))

print('Use requests+ThreadPoolExecutor cost: {}'.format(time.time() - start))

补充

以上问题建立在CPython，至于我喜欢用多线程，不喜欢协程风格这类型的回答显然不属于本题讨论范畴。我主要想请教的是：
如果Python拿不下GIL，我认为未来理想的模型应该是多进程 + 协程(asyncio+aiohttp)。和以及500lines一个爬虫项目已经开始这么干了。不讨论兼容型问题，上面的看法是否正确，有一些什么场景协程无法取代多线程。

异步有很多方案，twisted, tornado等都有自己的解决方案，问题建立在asyncio+aiohttp的协程异步。

还有一个问题

>>

这个答案描述的挺清楚的：

阅读(1315) | 评论(0) | 转发(0) |

上一篇：ASP.NET 获取IIS应用程序池的托管管道模式

下一篇：请问这个git上开源的node项目怎样才能在windows用Npm跑起来

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6