- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

都说python是万能的，这次用python看溧阳摄影圈，真不错

梦想橡皮擦发表于 2022/01/27 21:34:02 2022/01/27

【摘要】本篇博客继续学习 BeautifulSoup，目标站点选取“溧阳摄影圈”，这一地方论坛。目标站点分析本次要采集的目标站点分页规则如下：http://www.jsly001.com/thread-htm-fid-45-page-{页码}.html代码采用多线程 threading 模块+requests 模块+BeautifulSoup 模块编写。采取规则依据列表页 → 详情页。溧阳摄影圈...

本篇博客继续学习 BeautifulSoup，目标站点选取“溧阳摄影圈”，这一地方论坛。

目标站点分析

本次要采集的目标站点分页规则如下：

http://www.jsly001.com/thread-htm-fid-45-page-{页码}.html

代码采用多线程 threading 模块+requests 模块+BeautifulSoup 模块编写。

采取规则依据列表页 → 详情页。

溧阳摄影圈图片采集代码

本案例属于实操案例，bs4 相关知识点已经在上一篇博客进行铺垫，顾先展示完整代码，然后基于注释与重点函数进行说明。

import random
import threading
import logging

from bs4 import BeautifulSoup
import requests
import lxml

logging.basicConfig(level=logging.NOTSET) # 设置日志输出级别

# 声明一个 LiYang 类，其继承自 threading.Thread
class LiYangThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self) # 实例化多线程对象
        self._headers = self._get_headers() # 随机获取 ua
        self._timeout = 5 # 设置超时时间

    # 每个线程都去获取全局资源
    def run(self):
        # while True: # 此处为多线程开启位置
        try:
            res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._headers,
                               timeout=self._timeout) # 测试获取第一页数据
        except Exception as e:
            logging.error(e)

        if res is not None:
            html_text = res.text
            self._format_html(html_text) # 调用html解析函数

    def _format_html(self, html):
        # 使用 lxml 进行解析
        soup = BeautifulSoup(html, 'lxml')

        # 获取板块主题分割区域，主要为防止获取置顶的主题
        part_tr = soup.find(attrs={'class': 'bbs_tr4'})

        if part_tr is not None:
            items = part_tr.find_all_next(attrs={"name": "readlink"}) # 获取详情页地址
        else:
            items = soup.find_all(attrs={"name": "readlink"})

        # 解析出标题与数据
        data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items]
        # 进入标题内页
        for name, url in data:
            self._get_imgs(name, url)

    def _get_imgs(self, name, url):
        """解析图片地址"""
        try:
            res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
        except Exception as e:
            logging.error(e)
		# 图片提取逻辑
        if res is not None:
            soup = BeautifulSoup(res.text, 'lxml')
            origin_div1 = soup.find(attrs={'class': 'tpc_content'})
            origin_div2 = soup.find(attrs={'class': 'imgList'})
            content = origin_div2 if origin_div2 else origin_div1

            if content is not None:
                imgs = content.find_all('img')

                # print([img.get("src") for img in imgs])
                self._save_img(name, imgs) # 保存图片

    def _save_img(self, name, imgs):
        """保存图片"""
        for img in imgs:
            url = img.get("src")
            if url.find('http') < 0:
                continue
            # 寻找父标签中的 id 属性
            id_ = img.find_parent('span').get("id")

            try:
                res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
            except Exception as e:
                logging.error(e)

            if res is not None:
                name = name.replace("/", "_")
                with open(f'./imgs/{name}_{id_}.jpg', "wb+") as f: # 注意在 python 运行时目录提前创建 imgs 文件夹
                    f.write(res.content)

    def _get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua
        }
        return headers


if __name__ == '__main__':
    my_thread = LiYangThread()
    my_thread.run()

本次案例采用中，BeautifulSoup 模块采用 lxml 解析器 对 HTML 数据进行解析，后续多采用此解析器，在使用前注意先导入 lxml 模块。

数据提取部分采用 soup.find() 与 soup.find_all() 两个函数进行，代码中还使用了 find_parent() 函数，用于采集父级标签中的 id 属性。

# 寻找父标签中的 id 属性
id_ = img.find_parent('span').get("id")

代码运行过程出现 DEBUG 信息，控制 logging 日志输出级别即可。

代码仓库地址：https://codechina.csdn.net/hihell/python120，去给个关注或者 Star 吧。

写在后面

本篇博客为 bs4 应用篇，如有必要，请反复扩展学习。

今天是持续写作的第 239 / 365 天。
期待关注，点赞、评论、收藏。

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

都说python是万能的，这次用python看溧阳摄影圈，真不错

目标站点分析

溧阳摄影圈图片采集代码

写在后面

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品