- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

异步爬虫写起来太麻烦？来试试 Trio 吧！

竹叶青发表于 2019/10/28 22:44:58 2019/10/28

【摘要】 Trio 翻译过来是三重奏的意思,它提供了更方便异步编程，是 asyncio 的更高级的封装。它试图简化复杂的 asyncio 模块。使用起来比 asyncio 和 Twisted 要简单的同时,拥有其同样强大功能。这个项目还很年轻，还处于试验阶段但是整体设计是可靠的。作者鼓励大家去尝试使用，如果遇到问题可以在 git 上对他提 issue。同时作者还提供了一个在线聊天室更方便与其沟通：ht...

Trio 翻译过来是三重奏的意思,它提供了更方便异步编程，是 asyncio 的更高级的封装。

它试图简化复杂的 asyncio 模块。使用起来比 asyncio 和 Twisted 要简单的同时,拥有其同样强大功能。这个项目还很年轻，还处于试验阶段但是整体设计是可靠的。作者鼓励大家去尝试使用，如果遇到问题可以在 git 上对他提 issue。同时作者还提供了一个在线聊天室更方便与其沟通：https://gitter.im/python-trio/general。

准备工作

确保你的 Python版本在3.5以及以上。
安装 trio。python3 -m pip install --upgrade trio
import trio 运行是否有错误，没有错误可以往下进行了。

知识准备 Async 方法

使用 trio 也就意味着你需要一直写异步方法。

# 一个标准方法
def regular_double(x):
    return 2 * x

# 一个异步方法
async def async_double(x):
    return 2 * x

从外观上看异步方法和标准方法没什么区别只是前面多了个async。

“Async” 是“asynchronous”的简写，为了区别于异步函数，我们称标准函数为同步函数，从用户角度异步函数和同步函数有以下区别：

要调用异步函数，必须使用 await 关键字。因此，不要写 regular_double(3)，而是写 await async_double(3).
不能在同步函数里使用 await，否则会出错。
句法错误：

def print_double(x):
    print(await async_double(x))   # <-- SyntaxError here

但是在异步函数中，await 是允许的:

async def print_double(x):
    print(await async_double(x))   # <-- OK!

综上所述:作为一个用户，异步函数相对于常规函数的全部优势在于异步函数具有超能力:它们可以调用其他异步函数。

在异步函数中可以调用其他异步函数，但是凡事有始有终，第一个异步函数如何调用呢？

我们继续往下看

如何调用第一个异步函数

import trio

async def async_double(x):
    return 2 * x

trio.run(async_double, 3)  # returns 6

这里我们可以使用 trio.run 来调用第一个异步函数。
接下来让我们看看 trio 的其他功能

异步中的等待

import trio

async def double_sleep(x):
    await trio.sleep(2 * x)

trio.run(double_sleep, 3)  # does nothing for 6 seconds then returns

这里使用了异步等待函数 trio.sleep,它的功能和同步函数中的 time.sleep()差不多，但是因为需要使用 await 调用，所以由前面的结论我们知道这是一个异步函数用的等待方法。

事实这个例子没有实际用处，我们用同步函数就可以实现这个简单的功能。这里主要是为了演示异步函数中通过 await 可以调用其他的异步函数。

异步函数调用的典型结构

trio.run -> [async function] -> ... -> [async function] -> trio.whatever

不要忘了写 await

如果忘了写 await 会发生什么，我们看下面的这个例子

import time
import trio

async def broken_double_sleep(x):
    print("*yawn* Going to sleep")
    start_time = time.perf_counter()

    # 糟糕，我忘了写await
    trio.sleep(2 * x)

    sleep_time = time.perf_counter() - start_time
    print("Woke up after {:.2f} seconds, feeling well rested!".format(sleep_time))

trio.run(broken_double_sleep, 3)

运行之后发现

*yawn* Going to sleep
Woke up after 0.00 seconds, feeling well rested!
__main__:4: RuntimeWarning: coroutine 'sleep' was never awaited

报错了,错误类型是 RuntimeWarning,后面是说协程 sleep 没有使用 await。

我们打印下 trio.sleep(3) 看到如下内容，表示这是一个协程,也就是一个异步函数由前面的内容可知。

我们把上面的 trio.sleep(2 * x)改为 await trio.sleep(2 * x) 即可。

记住如果运行时警告:coroutine 'RuntimeWarning: coroutine '…' was never awaited'，也就意味这有个地方你没有写await。

运行多个异步函数

如果 trio 只是使用 await trio.sleep 这样毫无意义的例子就没有什么价值，所以下面我们来 trio 的其他功能，运行多个异步函数。

# tasks-intro.py

import trio

async def child1():
    print("  child1: started! sleeping now...")
    await trio.sleep(1)
    print("  child1: exiting!")

async def child2():
    print("  child2: started! sleeping now...")
    await trio.sleep(1)
    print("  child2: exiting!")

async def parent():
    print("parent: started!")
    async with trio.open_nursery() as nursery:
        print("parent: spawning child1...")
        nursery.start_soon(child1)

        print("parent: spawning child2...")
        nursery.start_soon(child2)

        print("parent: waiting for children to finish...")
        # -- we exit the nursery block here --
    print("parent: all done!")

trio.run(parent)

内容比较多让我们一步一步分析，首先是定义了 child1 和 child2 两个异步函数,定义方法和我们上面说的差不多。

async def child1():
    print("child1: started! sleeping now...")
    await trio.sleep(1)
    print("child1: exiting!")

async def child2():
    print("child2: started! sleeping now...")
    await trio.sleep(1)
    print("child2: exiting!")

接下来，我们将 parent 定义为一个异步函数，它将同时调用 child1 和 child2

async def parent():
    print("parent: started!")
    async with trio.open_nursery() as nursery:
        print("parent: spawning child1...")
        nursery.start_soon(child1)

        print("parent: spawning child2...")
        nursery.start_soon(child2)

        print("parent: waiting for children to finish...")
        # 到这里我们调用__aexit__，等待child1和child2运行完毕
    print("parent: all done!")

它通过使用神秘的 async with 语句来创建“nursery”,然后将 child1 和 child2 通过 nusery 方法的 start_soon 添加到 nursery 中。

下面我们来说说 async with，其实也很简单，我们知道再读文件时候我们使用with open()…去创建一个文件句柄,with里面牵扯到两个魔法函数

在代码块开始的时候调用__enter__()结束时再去调用__exit__()我们称open()为上下文管理器。async with someobj语句和with差不多只不过它调用的异步方法的魔法函数：__aenter__和__aexit__。我们称someobj为“异步上下文管理器”。

再回到上面的代码首先我们使用 async with 创建一个异步代码块
同时通过 nursery.start_soon(child1) 和 nursery.start_soon(child2) 调用child1和child2函数开始运行然后立即返回,这两个异步函数留在后台继续运行。

然后等待 child1 和 child2 运行结束之后，结束 async with 代码块里的内容，打印最后的

"parent: all done!"。

让我们看看运行结果

parent: started!
parent: spawning child1...
parent: spawning child2...
parent: waiting for children to finish...
  child2: started! sleeping now...
  child1: started! sleeping now...
    [... 1 second passes ...]
  child1: exiting!
  child2: exiting!
parent: all done!

可以发现和我们上面分析的一样。看到这里，如果你熟悉线程的话，你会发现这个运作机制和多线程类似。但是这里并不是线程，这里的代码全部在一个线程里面的完成，为了区别线程我们称这里的 child1 和 child2 为两个任务,有了任务，我们只能在某些我们称之为“checkpoints”的指定地点进行切换。后面我们再深挖掘它。

trio 里的跟踪器

我们知道上面的多个任务都是在一个线程中进行切换操作的，但是对于如何切换的我们并不了解，只有知道了这些我们才能更好的学好一个模块。
幸运的是，trio 提供了一组用于检查和调试程序的工具。我们可以通过编写一个 Tracer 类，来实现 trio.abc.Instrumen 接口。代码如下

class Tracer(trio.abc.Instrument):
    def before_run(self):
        print("!!! run started")

    def _print_with_task(self, msg, task):
        # repr(task) is perhaps more useful than task.name in general,
        # but in context of a tutorial the extra noise is unhelpful.
        print("{}: {}".format(msg, task.name))

    def task_spawned(self, task):
        self._print_with_task("### new task spawned", task)

    def task_scheduled(self, task):
        self._print_with_task("### task scheduled", task)

    def before_task_step(self, task):
        self._print_with_task(">>> about to run one step of task", task)

    def after_task_step(self, task):
        self._print_with_task("<<< task step finished", task)

    def task_exited(self, task):
        self._print_with_task("### task exited", task)

    def before_io_wait(self, timeout):
        if timeout:
            print("### waiting for I/O for up to {} seconds".format(timeout))
        else:
            print("### doing a quick check for I/O")
        self._sleep_time = trio.current_time()

    def after_io_wait(self, timeout):
        duration = trio.current_time() - self._sleep_time
        print("### finished I/O check (took {} seconds)".format(duration))

    def after_run(self):
        print("!!! run finished")

然后我们运行之前的示例但是这次我们传入的是一个 Tracer 对象。

trio.run(parent, instruments=[Tracer()])

然后我们会发现打印了一大堆东西下面我们一部分一部分分析。

!!! run started
### new task spawned: <init>
### task scheduled: <init>
### doing a quick check for I/O
### finished I/O check (took 1.787799919839017e-05 seconds)
>>> about to run one step of task: <init>
### new task spawned: __main__.parent
### task scheduled: __main__.parent
### new task spawned: <TrioToken.run_sync_soon task>
### task scheduled: <TrioToken.run_sync_soon task>
<<< task step finished: <init>
### doing a quick check for I/O
### finished I/O check (took 1.704399983282201e-05 seconds)

前面一大堆的信息我们不用去关心，我们看 ### new task spawned: __main__.parent,可知__main__.parent 创建了一个任务。

一旦初始的管理工作完成，trio 就开始运行 parent 函数，您可以看到 parent 函数创建了两个子任务。然后，它以块的形式到达异步的末尾，并暂停。

>>> about to run one step of task: __main__.parent
parent: started!
parent: spawning child1...
### new task spawned: __main__.child1
### task scheduled: __main__.child1
parent: spawning child2...
### new task spawned: __main__.child2
### task scheduled: __main__.child2
parent: waiting for children to finish...
<<< task step finished: __main__.parent

然后到了 trio.run(),记录了更多的内部运作过程。

>>> about to run one step of task: <call soon task>
<<< task step finished: <call soon task>
### doing a quick check for I/O
### finished I/O check (took 5.476875230669975e-06 seconds)

然后给这两个子任务一个运行的机会

>>> about to run one step of task: __main__.child2
  child2 started! sleeping now...
<<< task step finished: __main__.child2

>>> about to run one step of task: __main__.child1
  child1: started! sleeping now...
<<< task step finished: __main__.child1

每个任务都在运行，直到调用 trio.sleep() 然后突然我们又回到 trio.run () 决定下一步要运行什么。这是怎么回事？秘密在于 trio.run () 和 trio.sleep () 一起实现的，trio.sleep() 可以获得一些特殊的魔力，让它暂停整个调用堆栈，所以它会向 trio.run () 发送一个通知，请求在1秒后再次被唤醒，然后暂停任务。任务暂停后，Python 将控制权交还给 trio.run ()，由它决定下一步要做什么。

注意：在 trio 中不能使用 asyncio.sleep()。

接下来它调用一个操作系统原语来使整个进程进入休眠状态

### waiting for I/O for up to 0.9997810370005027 seconds

1s休眠结束后

### finished I/O check (took 1.0006483688484877 seconds)
### task scheduled: __main__.child1
### task scheduled: __main__.child2

还记得 parent 是如何的等待两个子任务结束的么，下面注意观察 child1 退出的时候 parent 在干什么

>>> about to run one step of task: __main__.child1
  child1: exiting!
### task scheduled: __main__.parent
### task exited: __main__.child1
<<< task step finished: __main__.child1

>>> about to run one step of task: __main__.child2
  child2 exiting!
### task exited: __main__.child2
<<< task step finished: __main__.child2

然后先进行 io 操作,然后 parent 任务结束

### doing a quick check for I/O
### finished I/O check (took 9.045004844665527e-06 seconds)

>>> about to run one step of task: __main__.parent
parent: all done!
### task scheduled: <init>
### task exited: __main__.parent
<<< task step finished: __main__.parent

最后进行一些内部操作代码结束

### doing a quick check for I/O
### finished I/O check (took 5.996786057949066e-06 seconds)
>>> about to run one step of task: <init>
### task scheduled: <call soon task>
### task scheduled: <init>
<<< task step finished: <init>
### doing a quick check for I/O
### finished I/O check (took 6.258022040128708e-06 seconds)
>>> about to run one step of task: <call soon task>
### task exited: <call soon task>
<<< task step finished: <call soon task>
>>> about to run one step of task: <init>
### task exited: <init>
<<< task step finished: <init>
!!! run finished

ok，这一部分只要说了运作机制了解即可，当然记住更方便对 trio 的理解。
关于更多的 trio 的使用，敬请期待。。。

参考资料

https://trio.readthedocs.io/en/latest/tutorial.html#before-you-begin

转载声明：本文转载自公众号【进击的Coder】。

原文链接：https://mp.weixin.qq.com/s/b9ApWrG6Y5qvT3s7GhXRDA

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

异步爬虫写起来太麻烦？来试试 Trio 吧！

准备工作

知识准备 Async 方法

如何调用第一个异步函数

异步中的等待

不要忘了写 await

运行多个异步函数

trio 里的跟踪器

参考资料

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

异步爬虫写起来太麻烦？来试试 Trio 吧！

准备工作

知识准备 Async 方法

如何调用第一个异步函数

异步中的等待

不要忘了写 await

运行多个异步函数

trio 里的跟踪器

参考资料

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品