- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

我用Python连夜离线了100G图片，只为了防止网站被消失

梦想橡皮擦发表于 2021/12/10 14:39:00 2021/12/10

【摘要】用 Python 爬取 100G Cosers 图片本篇博客目标爬取目标目标数据源：http://www.cosplay8.com/pic/chinacos/，又是一个 Cos 网站，该类网站很容易消失在互联网中，为了让数据存储下来，我们盘它。使用的 Python 模块requests，re，os重点学习内容今日的重点学习，可放在详情页分页抓取上，该技巧在之前的博客中没有涉及，编写代...

用 Python 爬取 100G Cosers 图片

本篇博客目标

爬取目标

目标数据源：http://www.cosplay8.com/pic/chinacos/，又是一个 Cos 网站，该类网站很容易消失在互联网中，为了让数据存储下来，我们盘它。

使用的 Python 模块

requests，re，os

重点学习内容

今日的重点学习，可放在详情页分页抓取上，该技巧在之前的博客中没有涉及，编写代码过程中重点照顾一下。

列表页与详情页分析

通过开发者工具，可以便捷的分析出目标数据所在的标签。

点击任意图片，进入详情页，得到目标图片为单页展示，即每页展示一张图片。

<a href="javascript:dPlayNext();" id="infoss">
  <img
    src="/uploads/allimg/210601/112879-210601143204.jpg"
    id="bigimg"
    width="800"
    alt=""
    border="0"
/></a>

同时获取列表页与详情页 URL 生成规则如下：

列表页

http://www.cosplay8.com/pic/chinacos/list_22_1.html
http://www.cosplay8.com/pic/chinacos/list_22_2.html
http://www.cosplay8.com/pic/chinacos/list_22_3.html

详情页

http://www.cosplay8.com/pic/chinacos/2021/0601/61823.html
http://www.cosplay8.com/pic/chinacos/2021/0601/61823_2.html
http://www.cosplay8.com/pic/chinacos/2021/0601/61823_3.html

注意详情页首页无序号 1，顾爬取获取总页码的同时，需存储首页图片。

编码时间

目标网站对图片进行了分类，即 国内 cos，国外 cos，汉服圈，Lolita，因此在爬取时可以对其进行动态输入，即爬取目标源自定义。


def run(category, start, end):
    # 生成待爬取的列表页
    wait_url = [
        f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
    	# get_list 函数在后文提供
        ret = get_list(item)

        print(f"已经抓取：{len(ret)} 条数据")
        url_list.extend(ret)


if __name__ == "__main__":

    # http://www.cosplay8.com/pic/chinacos/list_22_2.html
    category = input("请输入分类编号：")
    start = input("请输入起始页：")
    end = input("请输入结束页：")
    run(category, start, end)

上述代码首先基于用户的输入，生成目标网址，然后将目标网址一次传递到 get_list 函数中，该函数代码如下：

def get_list(url):
    """
    获取全部详情页链接
    """
    all_list = []

    res = requests.get(url, headers=headers)
    html = res.text
    pattern = re.compile('<li><a href="(.*?)">')
    all_list = pattern.findall(html)

    return all_list

通过正则表达式 <li><a href="(.*?)"> 匹配列表页中所有详情页地址，并将其进行整体返回。

在 run 函数中继续增加代码，获取详情页图片素材，并对抓取到的图片进行保存。

def run(category, start, end):
    # 待爬取的列表页
    wait_url = [
        f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
        ret = get_list(item)

        print(f"已经抓取：{len(ret)} 条数据")
        url_list.extend(ret)

    print(url_list)
    # print(len(url_list))
    for url in url_list:
        get_detail(f"http://www.cosplay8.com{url}")

由于匹配到的详情页地址为相对地址，顾对地址进行格式化操作，生成完整地址。
get_detail 函数代码如下：

def get_detail(url):
	# 请求详情页数据
    res = requests.get(url=url, headers=headers)
    # 设置编码
    res.encoding = "utf-8"
    # 得到网页源码
    html = res.text

    # 拆解页码，保存第一张图片
    size_pattern = re.compile('<span>共(\d+)页: </span>')
    # 获取标题，后续发现发表存在差异，顾正则表达式有修改
    # title_pattern = re.compile('<title>(.*?)-Cosplay中国</title>')
    title_pattern = re.compile('<title>(.*?)-Cosplay(中国|8)</title>')
    # 设置图片正则表达式
    first_img_pattern = re.compile("<img src='(.*?)' id='bigimg'")
    try:
    	# 尝试匹配页码
        page_size = size_pattern.search(html).group(1)
        # 尝试匹配标题
        title = title_pattern.search(html).group(1)
        # 尝试匹配地址
        first_img = first_img_pattern.search(html).group(1)

        print(f"URL对应的数据为{page_size}页", title, first_img)
        # 生成路径
        path = f'images/{title}'
        # 路径判断
        if not os.path.exists(path):
            os.makedirs(path)

        # 请求第一张图片
        save_img(path, title, first_img, 1)

        # 请求更多图片
        urls = [f"{url[0:url.rindex('.')]}_{i}.html" for i in range(2, int(page_size)+1)]

        for index, child_url in enumerate(urls):
            try:
                res = requests.get(url=child_url, headers=headers)

                html = res.text
                first_img_pattern = re.compile("<img src='(.*?)' id='bigimg'")
                first_img = first_img_pattern.search(html).group(1)

                save_img(path, title, first_img, index)
            except Exception as e:
                print("抓取子页", e)

    except Exception as e:
        print(url, e)

上述代码核心逻辑已经编写到注释中，重点在 title 正则匹配部分，初始编写正则表达式如下：

<title>(.*?)-Cosplay中国</title>

后续发现不能全部匹配成功，修改为如下内容：

<title>(.*?)-Cosplay(中国|8)</title>

，缺少的 save_img 函数代码如下：

def save_img(path, title, first_img, index):
    try:
        # 请求图片
        img_res = requests.get(f"http://www.cosplay8.com{first_img}", headers=headers)
        img_data = img_res.content

        with open(f"{path}/{title}_{index}.png", "wb+") as f:
            f.write(img_data)
    except Exception as e:
        print(e)

完整代码下载地址：https://codechina.csdn.net/hihell/python120，No6。

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

我用Python连夜离线了100G图片，只为了防止网站被消失

用 Python 爬取 100G Cosers 图片

本篇博客目标

编码时间

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

我用Python连夜离线了100G图片，只为了防止网站被消失

用 Python 爬取 100G Cosers 图片

本篇博客目标

编码时间

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品