- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

python学习笔记之爬虫(五) 进程、线程、协程实战丨【生长吧！Python】

菜鸟级攻城狮发表于 2021/07/06 21:43:48 2021/07/06

【摘要】 python学习笔记之爬虫(五) 进程、线程、协程实战

''' 异步爬虫实战：爬取小说 '''

# http://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"4306063500"} => 所有章节的内容(名称，cid)
# 章节内部的内容
# http://dushu.baidu.com/api/pc/getChapterContent?data={"book_id":"4306063500","cid":"4306063500|11349571","need_bookinfo":1}

import requests
import asyncio
import aiohttp
import aiofiles
import json

'''
1、同步操作：访问getCatalog 拿到所有章节的书名和cid
2、异步操作：访问getChapterContent 下载所有的文章内容
'''

async def aiodownload(b_id, cid, title):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36 Edg/91.0.864.41'
    }
    data = {
        "book_id": b_id,
        "cid": f"{b_id}|{cid}",
        "need_bookinfo": 1
    }
    # 将json对象变成json格式的字符串
    data = json.dumps(data)
    url = f'http://dushu.baidu.com/api/pc/getChapterContent?data={data}'

    # 异步发送请求
    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers) as resp:
            dct = await resp.json()

            # 将内容保存到文件中
            async with aiofiles.open('novel/' + title, 'w', encoding='utf-8') as f:
                await f.write(dct['data']['novel']['content']) # 把小说内容写出来

async def getCatalog(url):
    resp = requests.get(url)    # 同步操作，该操作没完成，后面的就没法进行
    dct = resp.json()
    tasks = []
    for item in dct['data']['novel']['items']: # item就是对应的每一个章节的名称和cid
        title = item['title']
        cid = item['cid']
        # 准备异步任务
        tasks.append(aiodownload(b_id, cid, title))
    await asyncio.wait(tasks)

if __name__ == '__main__':
    b_id = '4306063500'
    # 这里的链接如果用f'' 则大括号{}会发生转义，所以用拼接的方式
    url = 'http://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"' + b_id + '"}'
    asyncio.run(getCatalog(url))

''' 综合训练：视频网站的工作原理 '''

# 需要一个文件记录：1、视频播放顺序 2、视频存放的路径
# m3u8 txt json ==> 文本

# 想要抓取一个视频：
# 1、找到m3u8(各种手段)
# 2、通过m3u8下载到ts文件
# 3、通过各种手段(不仅是编程手段)把ts文件合并为一个mp4文件

''' 抓取91看剧简单版-熟悉m3u8结构 '''
'''

案例链接：
    1、Python通过m3u8文件下载合并ts视频
        https://blog.csdn.net/weixin_38819889/article/details/103434122
    2、利用python爬虫通过m3u8文件下载ts视频
        https://zhuanlan.zhihu.com/p/70290764

什么是m3u8文件：
    m3u8是苹果公司推出一种视频播放标准，是一种文件检索格式，将视频切割成一小段一小段的ts格式的视频文件，
    然后存在服务器中（现在为了减少I/o访问次数，一般存在服务器的内存中），通过m3u8解析出来路径，然后去请求，
    是现在比较流行的一种加载方式。目前，很多新闻视频网站都是采用这种模式去加载视频。

    M3U8文件是指UTF-8编码格式的M3U文件。M3U文件是记录了一个索引纯文本文件，打开它时播放软件并不是播放它，
    而是根据它的索引找到对应的音视频文件的网络地址进行在线播放。原视频数据分割为很多个TS流，每个TS流的地址记录在m3u8文件列表中。

m3u8文件中的 m3u8标签与属性说明：
    #EXTM3U
        每个M3U文件第一行必须是这个tag，请标示作用
    #EXT-X-VERSION:3
        该属性可以没有
    #EXT-X-MEDIA-SEQUENCE:140651513
        每一个media URI在PlayList中只有唯一的序号，相邻之间序号+1,
        一个media URI并不是必须要包含的，如果没有，默认为0
    #EXT-X-ALLOW-CACHE:NO
        是否允许客户端对下载的视频分段缓存用于以后播放？
    #EXT-X-TARGETDURATION
        指定最大的媒体段时间长（秒）。所以#EXTINF中指定的时间长度必须小于或是等于这
        个最大值。这个tag在整个PlayList文件中只能出现一次（在嵌套的情况下，一般有
        真正ts url的m3u8才会出现该tag）
    #EXT-X-PLAYLIST-TYPE
        提供关于PlayList的可变性的信息，这个对整个PlayList文件有效，是可选的，格式
        如下：#EXT-X-PLAYLIST-TYPE:：如果是VOD，则服务器不能改变PlayList 文件；
        如果是EVENT，则服务器不能改变或是删除PlayList文件中的任何部分，但是可以向该
        文件中增加新的一行内容。
    #EXTINF
        duration指定每个媒体段(ts)的持续时间（秒），仅对其后面的URI有效，title是下载资源的url
    #EXT-X-KEY
        表示怎么对media segments进行解码。其作用范围是下次该tag出现前的所有media
        URI，属性为NONE 或者 AES-128。NONE表示 URI以及IV（Initialization
        Vector）属性必须不存在， AES-128(Advanced EncryptionStandard)表示URI
        必须存在，IV可以不存在。
    #EXT-X-PROGRAM-DATE-TIME
        将一个绝对时间或是日期和一个媒体段中的第一个sample相关联，只对下一个meida
        URI有效，格式如#EXT-X-PROGRAM-DATE-TIME:
        For example: #EXT-X-PROGRAM-DATETIME:2010-02-19T14:54:23.031+08:00
    #EXT-X-ALLOW-CACHE
        是否允许做cache，这个可以在PlayList文件中任意地方出现，并且最多出现一次，作
        用效果是所有的媒体段。格式如下：#EXT-X-ALLOW-CACHE:
    #EXT-X-ENDLIST
        表示PlayList的末尾了，它可以在PlayList中任意位置出现，但是只能出现一个，格
        式如下：#EXT-X-ENDLIST

流程：
    1、拿到54812-1-1.html的页面源代码
    2、从源代码中提取到m3u8的url
    3、下载m3u8
    4、读取m3u8文件，下载视频
    5、合并视频
'''
import requests
import re
import os

url = 'https://91kanju.com/vod-play/54812-1-1.html'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36 Edg/91.0.864.41'
}

resp = requests.get(url, headers=headers)

pattern = re.compile(r"url: '(?P<url>.*?)',", re.S)
m3u8_url = pattern.search(resp.text).group('url')   # 拿到m3u8的地址
# print(m3u8_url)
resp.close()

# 下载m3u8文件
resp2 = requests.get(m3u8_url, headers=headers)
if not os.path.exists('video'):
    os.mkdir('video', 0o755)
with open('TSVideo/哲仁王后' + '哲仁王后.m3u8', 'wb') as f:
    f.write(resp2.content)
resp2.close()
print('下载完毕')
# 下载完毕以后，注释以上全部代码

# 解析m3u8文件
i = 1
with open('TSVideo/哲仁王后/哲仁王后.m3u8', 'r', encoding='utf-8')as f:
    # print(f)    # <_io.TextIOWrapper name='video/哲仁王后.m3u8' mode='r' encoding='utf-8'>
    for line in f:
        line = line.strip()         # 先去掉空格，空白，换行符
        if line.startswith('#'):    # 过滤掉以#开头的内容
            continue
        # print(line)

        # 下载视频片段
        with requests.get(line, headers=headers) as resp3:
            with open(f'video/{i}.ts', 'wb') as f2:
                f2.write(resp3.content)

        print(f'{line} {i} 完成')
        i += 1
        # 下载完成后，视频中使用的是QuickTime Player播放器播放的
        # Windows默认播放器也可以播放

''' 91看剧复杂版 '''
'''

思路：
    1、拿到主页面的源代码，找到iframe对应的url
    2、从iframe的页面源代码中拿到m3u8文件的地址
    3、下载第一层m3u8文件 -> 下载第二层m3u8文件(视频存放路径)
    4、下载视频
    5、下载密钥，进行解密操作
    6、合并所有ts文件为一个mp4文件
'''
import requests
from bs4 import BeautifulSoup
import re
import asyncio
import aiohttp
import aiofiles
import os
from Cryptodome.Cipher import AES
'''
官方文档：
    首页：https://www.pycryptodome.org/en/latest/
    AES：https://www.pycryptodome.org/en/latest/src/examples.html#encrypt-data-with-aes

python安装AES库(Advanced Encryption Standard)及使用
    https://blog.csdn.net/m0_52693073/article/details/110943828

python3.8上Crypto用不了，改用pycryptodomex
安装pycryptodomex：
    pip install pycryptodomex
用法：
    from Cryptodome.Cipher import AES
    aes = AES.new(key, AES.MODE_CBC, IV=)
    content = aes.decrypt(content)
IV格式如果是下面这种会报错：
    '0x674B2D91C06186399C33DC5901307461'
报错如下：
    ValueError: Incorrect IV length (it must be 16 bytes long)
解决方法：需要把hex转换成bytes:
    from binascii import unhexlify
    iv = unhexlify(IV_str.replace('0x', ''))
结果：
    unhexlify('672482D91C06186399C33DC5901307461')
    b'gK-\x91\xc0a\x869\x9c3\xdcY\x010ta'
'''

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36 Edg/91.0.864.41'
}

def get_iframe_src(url):
    resp = requests.get(url, headers=headers)
    # 获取iframe标签的src，整个页面只有一个iframe标签，使用bs最合适
    main_page = BeautifulSoup(resp.text, 'html.parser')
    src = main_page.find('iframe').get('src')
    # print(src)    # https://boba.52kuyun.com/share/xfPs9NPHvYGhNzFp
    return src

def get_first_meu8_url(url):
    resp = requests.get(url, headers=headers)
    # 想从script中获取内容，最好的方式是re
    obj = re.compile(r'var main = "(<?P<m3u8_url>.*?)"', re.S)
    m3u8_url = obj.search(resp.text).group('m3u8_url')
    # print(m3u8_url)   # /20170907/Moh2l9zV/index.m3u8?sign=548ae366a075f0fie7c76af215aa18e1
    return m3u8_url

def download_m3u8_file(url, file_name):
    resp = requests.get(url, headers=headers)
    with open(file_name, 'wb') as f:
        f.write(resp.content)

# 视频https://www.bilibili.com/video/BV1Mf4y1s7ds?p=76
async def download_ts(ts_url, name, session):
    async with session.get(ts_url, headers=headers) as resp:
        async with aiofiles.open(f'TSVideo/{name}', 'wb') as f:
            # 把下载到的内容写入到文件中
            # 不论是发送请求也好，创建文件也好，都是异步操作
            # 不加await报错Coroutine 'write' is not awaited
            await f.write(await resp.content.read())
    # 最后的下载提示信息
    print(f'{name} 下载完毕！')

async def aio_download(prefix_url): # https://boba.52kuyun.com//20170907/Moh2l9zV/hls
    tasks = []
    # 提前准备好session，传递给download_ts函数使用

    async with aiohttp.ClientSession() as session:
        async with aiofiles.open('越狱第一季第一集_second_m3u8_url.txt', 'r', encoding='utf-8') as f:
            # for前不加async则f报错：Expected type 'collections.Iterable', got 'AsyncTextIOWrapper' instead
            async for line in f:
                if line.startswith('#'):
                    continue
                # line就是xxxxxx.ts
                line = line.strip() # 去点空白和换行
                # 拼接真正的ts路径
                ts_url = prefix_url + line
                # 准备任务列表tsks之后，创建任务
                task = asyncio.create_task(download_ts(ts_url, line, session))   # 创建任务，asyncio.create_task放入协程对象
                # 把每个任务放入tasks列表
                tasks.append(task)

            # 必须在for循环结束之后
            await asyncio.wait(tasks)   # 等待任务结束
            '''
            async def wait(fs, *, loop=None, timeout=None, return_when=ALL_COMPLETED):
                Wait for the Futures and coroutines given by fs to complete.
                等待fs给出的Futures和协程完成
                The sequence futures must not be empty.
                序列futures不能为空。
                Coroutines will be wrapped in Tasks.
                协程将封装在任务中
                Returns two sets of Future: (done, pending).
                返回两组Future:(done, pending)    done:已完成的，pending:挂起的
                Usage:
                    done, pending = await asyncio.wait(fs)

                Note: This does not raise TimeoutError! Futures that aren't done
                when the timeout occurs are returned in the second set.
            '''

def get_key(url):
    resp = requests.get(url, headers=headers)
    return resp.text

'''
AES加密用法：
    from Cryptodome.Cipher import AES
    aes = AES.new(key, AES.MODE_CBC, IV=)
    content = aes.decrypt(content)
'''
async def dec_ts(file_name, key):
    # AES.new(key, mode, iv, IV, nonce, segment_size, mac_len, assoc_len, initial_value, counter, use_aesni)
    aes = AES.new(key, AES.MODE_CBC, IV=b'0000000000000000')# IV为16位
    async with aiofiles.open(f'TSVideo/{file_name}', 'rb') as f1,\
        aiofiles.open(f'TSVideo/temp_{file_name}', 'wb') as f2: # \是换行
        read = await f1.read() # 从源文件读取内容、
        await f2.write(aes.decrypt(read))   # 把解密后的内容写入文件
    print(f'{file_name} 处理完毕！')

# 解密，少了前8分钟视频https://www.bilibili.com/video/BV1Mf4y1s7ds?p=77
async def aio_dec(key):
    tasks = []
    async with aiofiles.open('越狱第一季第一集_second_m3u8_url.txt', 'r', encoding='utf-8') as f:
        async for line in f:
            if line.startswith('#'):
                continue
            line = line.strip() # 去掉空白、换行符
            # 开始创建异步任务
            task = asyncio.create_task(dec_ts(line, key))
            tasks.append(task)

        await asyncio.wait(tasks)

# 合并成mp4文件：https://www.bilibili.com/video/BV1Mf4y1s7ds?p=78
def merge_ts():
    # 合并ts文件命令：
    # mac: cat 1.ts 2.ts 3.ts > xxx.mp4
    # windows: copy /b 1.ts+2.ts+3.ts xxx.mp4

    # 准备一个空列表，存放所有的ts文件，用来执行合并操作
    lst = []
    with open('越狱第一季第一集_second_m3u8_url.txt', 'r', encoding='utf-8') as f:
        for line in f:
            if line.startswith('#'):
                continue
            line = line.strip()
            lst.append(f'TSVideo/temp_{line}')
    # 拼接操作
    s = '+'.join(lst)   # 得到1.ts+2.ts+3.ts
    # 执行操作系统命令，合并视频
    os.system(f'copy /b {s} movie.mp4')
    print('合并视频成功！')

def main(url):
    # 1、拿到主页面的源代码，找到iframe对应的url
    iframe_src = get_iframe_src(url)

    # 2、拿到第一层meu8文件的地址
    first_m3u8_url = get_first_meu8_url(iframe_src)
    # 拿到iframe的域名  https://boba.52kuyun.com/share/xfPs9NpHvYGhNzFp
    iframe_domain = iframe_src.split('/share')[0]
    # 拼接出真正的m3u8的下载地址
    first_m3u8_url = iframe_domain + first_m3u8_url
    # print(first_m3u8_url)
    # https://boba.52kuyun.com//20170907/Moh2l9zV/index.m3u8?sign=548ae366a075f0fie7c76af215aa18e1

    # 3.1、下载第一层m3u8文件
    download_m3u8_file(first_m3u8_url, '越狱第一季第一集_first_m3u8_url.txt')
    print('第一层m3u8下载完毕！')

    # 3.2、下载第二层m3u8文件
    with open('越狱第一季第一集_first_m3u8_url.txt', 'r', encoding='utf-8') as f:
        for line in f:
            if line.startswith('#'):
                continue
            else:
                line = line.strip() # 去点空白、换行符

                # 准备拼接第二层m3u8的下载路径
                # https://boba.52kuyun.com//20170907/Moh2l9zV/ + hls/index.m3u8
                second_m3u8_url = first_m3u8_url.split('index.m3u8')[0] + line
                # print(second_m3u8_url)
                # https://boba.52kuyun.com//20170907/Moh2l9zV/hls/index.m3u8
                # ts文件下载路径
                # https://boba.52kuyun.com//20170907/Moh2l9zV/hls/cFN803436000.ts

                # 下载操作
                download_m3u8_file(second_m3u8_url, '越狱第一季第一集_second_m3u8_url.txt')
                print('第二层m3u8下载完毕！')

    # 4、下载视频。视频https://www.bilibili.com/video/BV1Mf4y1s7ds?p=75
    second_m3u8_url_prefix = second_m3u8_url.replace('index.m3u8', '')
    # 异步协程，IO操作多，任务量大
    asyncio.run(aio_download(second_m3u8_url_prefix))

    # 5.1 拿到密钥
    key_url = second_m3u8_url_prefix + 'key.key'    # 偷懒写法，正常应该去m3u8文件里找
    key = get_key(key_url)

    # 5.2 解密
    asyncio.run(aio_dec(key))

    # 6、合并ts文件为mp4文件
    merge_ts()

if __name__ == '__main__':
    url = 'https://www.91kanju.com/vod-play/541-2-1.html'
    main(url)

    '''
    https://www.bilibili.com/video/BV1Mf4y1s7ds?p=79 02:31
    简单的问题复杂化，复杂的问题简单化
    复杂的问题简单化：
        在这么复杂的需求下我们怎么一步一步地搞定呢？
        首先要缕清思路，把要做的事先罗列出来，一个一个拿出来分解。
        在编写过程中，你可能会有新的思路，把它加入到原先的结构中，按照结构逐个解决。
    简单的问题复杂化：
        你的编程能力为什么一直停滞不前？干了好几年还在做CRUD。
        把简单的问题做到极致，比如拿秒杀来举例，
    '''

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

python学习笔记之爬虫(五) 进程、线程、协程实战丨【生长吧！Python】

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

python学习笔记之爬虫(五) 进程、线程、协程 实战 丨【生长吧！Python】

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品

python学习笔记之爬虫(五) 进程、线程、协程实战丨【生长吧！Python】