- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

【python爬虫】PyQuery实战：爬取某站每日榜单图片

爱打瞌睡的CV君发表于 2022/07/07 23:38:20 2022/07/07

【摘要】文章目录一、构造url二、页面爬取三、页面解析四、下载图片五、详情页的处理六、完整代码七、运行结果展示一、构造url 先取网站一页的url进行观察： https://www.vil...

文章目录

一、构造url

先取网站一页的url进行观察：
https://www.vilipix.com/ranking?date=20220122&mode=daily&p=2
可以发现，是由四部分组成，简单拆分一下：

base_url=https://www.vilipix.com
日期 date
榜单类型mode
页码p

于是可以构造url为：

url = f'{base_url}/ranking?date={today_str}&mode=daily&p={page}'
   '''
   base_url:https://www.vilipix.com
   today_str:获取当天网站榜单日期
   page:榜单页码
   '''

  
 
  1
  2
  3
  4
  5
  6

二、页面爬取

可以定义一个函数，用于页面的爬取：

def scrap_page(url):
    try:
        response = requests.get(url=url, headers=ua_random())
        if response.status_code == 200:
            response.encoding = 'utf-8'  #提前了解到网页的编码格式，所以直接写
            return response.text
    except requests.RequestException:
        print(f'{url}不可爬取！')

  
 
  1
  2
  3
  4
  5
  6
  7
  8

函数返回的是页面的爬取结果

三、页面解析

爬取到页面，但信息有很多，需要筛选出对自己有用的信息

这里也可以构建一个函数来完成

def parse_index(html):
    doc = pq(html)
    links = doc('#__layout .illust-content li .illust a')
    for link in links.items():
        href = link.attr('href')
        name = href.split('/')[-1]  # 详情页名字，由图片id构成，以防重名
        detail_url = urljoin(base_url, href)  # 详情页url
        page_count = link('.page-count span').text()
        yield detail_url, page_count, name

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9

对于yield的用法，这位博主讲的很清晰，有需要的小伙伴可以参考一下：

python中yield的用法详解——最简单，最清晰的解释

四、下载图片

这个函数写的时候是最后写的，但是后来完善代码的时候，被提前调用了，就先写出来了，非常的简单，且很实用。

def download(path, name, image):
    save_path = path + name + '.jpg'
    with open(save_path, 'wb') as f:
        f.write(image)

  
 
  1
  2
  3
  4

path:图片下载的目标文件夹路径
name:文件名
image:需要下载的图片

五、详情页的处理

在第三步时，对页面进行处理得到的链接，访问即可得到详情页。

对于详情页的处理，分为两部分：

①、仅含一张图片
②、含有多张图片

什么意思呢？下图来说明一下：

右上角，有数字的为第②种情况，没数字的则为第①种情况。

为什么要分类呢？

为了容易区分，我是将含多张图片详情页中获取的图片放在一个文件夹里，详细可见后续。

函数如下：

第①种情况：

def detail_index_1(html, name, path):
    doc = pq(html)
    link = doc('.illust-pages li a img').attr('src')
    image = requests.get(url=link, headers=ua_random()).content  # 将要下载的图片
    download(path, name, image)  # 调用下载函数

  
 
  1
  2
  3
  4
  5

第②种情况：

def detail_index_more(html, name, path):
    doc = pq(html)
    links = doc('.illust-pages li a img')
    i = 1
    for link in links.items():
        src = link.attr('src')
        image_name = name + f'_{i}' # 进行图片命名
        image = requests.get(url=src, headers=ua_random()).content  # 将要下载的图片
        download(path, image_name, image)  # 调用下载函数
        i += 1

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10

html:页面解析的结果
path:图片下载的目标文件夹路径
name:文件名

六、完整代码

# -*- coding: UTF-8 -*-
"""
# @Time: 2022/1/22 16:43
# @Author: 远方的星
# @CSDN: https://blog.csdn.net/qq_44921056
"""
import requests
from pyquery import PyQuery as pq
from fake_useragent import UserAgent
import os
import datetime
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor

# 随机请求头
ua = UserAgent(verify_ssl=False, path='D:/Pycharm/fake_useragent.json')
# 网站url
base_url = 'https://www.vilipix.com'
# 获取当前日期
today = datetime.date.today()
# 获取昨天的日期，并用于构建url
today_str = (datetime.date.today() + datetime.timedelta(days=-1)).strftime('%Y%m%d')
# 分布创建属于当日榜单的文件夹
path_1 = 'D:/vilipix每日榜单'
if not os.path.exists(path_1):
    os.mkdir(path_1)

path_2 = f'D:/vilipix每日榜单/{today}/'
if not os.path.exists(path_2):
    os.mkdir(path_2)


def ua_random():
    headers = {
        'use_agent': ua.random
    }
    return headers


def scrap_page(url):
    try:
        response = requests.get(url=url, headers=ua_random())
        if response.status_code == 200:
            response.encoding = 'utf-8'
            return response.text
    except requests.RequestException:
        print(f'{url}不可爬取！')


def scrap_index(page):
    url = f'{base_url}/ranking?date={today_str}&mode=daily&p={page}'
    '''
    base_url:https://www.vilipix.com
    today_str:获取当天网站榜单日期
    page:榜单页码
    '''
    return scrap_page(url)


# 对页面进行解析
def parse_index(html):
    doc = pq(html)
    links = doc('#__layout .illust-content li .illust a')
    for link in links.items():
        href = link.attr('href')
        name = href.split('/')[-1]  # 详情页名字，由图片id构成，以防重名
        detail_url = urljoin(base_url, href)  # 详情页url
        page_count = link('.page-count span').text()
        # print(page_count)
        yield detail_url, page_count, name


# 下载图片
def download(path, name, image):
    save_path = path + name + '.jpg'
    # print(save_path)
    with open(save_path, 'wb') as f:
        f.write(image)


# 详情页内仅有一张图片时调用
def detail_index_1(html, name, path):
    doc = pq(html)
    link = doc('.illust-pages li a img').attr('src')
    image = requests.get(url=link, headers=ua_random()).content
    download(path, name, image)


# 详情页内有超过一张图片时调用
def detail_index_more(html, name, path):
    doc = pq(html)
    links = doc('.illust-pages li a img')
    i = 1
    for link in links.items():
        src = link.attr('src')
        image_name = name + f'_{i}'
        image = requests.get(url=src, headers=ua_random()).content
        download(path, image_name, image)
        i += 1


def main(page):
    html = scrap_index(page)
    details = parse_index(html)
    for detail in details:
        detail_url = detail[0]  # 详情页的url
        num = detail[1]  # 详情页内图片的数量
        name = detail[2]  # 给详情页命的名
        detail_html = scrap_page(detail_url)
        if num == '1':  # 第①种情况
            detail_index_1(detail_html, name, path_2)
        else:  # 第②种情况
            path_3 = f'D:/vilipix每日榜单/{today}/{name}/'
            if not os.path.exists(path_3):
                os.mkdir(path_3)
            detail_index_more(detail_html, name, path_3)
        print('*'*10, f'{name}下载完毕！', '*'*10)


if __name__ == '__main__':
    pages = list(range(1, 11))
    # 使用多线程进行加速
    with ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(main, pages)


  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
  100
  101
  102
  103
  104
  105
  106
  107
  108
  109
  110
  111
  112
  113
  114
  115
  116
  117
  118
  119
  120
  121
  122
  123
  124
  125

七、运行结果展示

每日榜单的信息是滞后一天的。

今天是1.23
爬取的则是1.22的榜单

如果对你有帮助，记得点个赞👍哟，也是对作者最大的鼓励🙇‍♂️。
如有不足之处可以在评论区👇多多指正，我会在看到的第一时间进行修正

作者：远方的星
CSDN：https://blog.csdn.net/qq_44921056
本文仅用于交流学习，未经作者允许，禁止转载，更勿做其他用途，违者必究。

文章来源: luckystar.blog.csdn.net，作者：爱打瞌睡的CV君，版权归原作者所有，如需转载，请联系作者。

原文链接：luckystar.blog.csdn.net/article/details/122657835

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入