- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

爬虫不会写？找ChatGPT不就完了，实战爬取某手办网~~~

小馒头学Python 发表于 2023/12/18 16:01:37 2023/12/18

【摘要】文章目录🍋引言🍋训练+代码🍋扩展🍋解决保存后的csv文件，使用Excel打开总是乱码🍋总结🍋引言今天来点有意思的，使用chatGPT如何爬取一个手办网八、使用BeautifulSoup4解析HTML实战（二）这篇文章的之前使用bs4进行爬取的文章，今天我们解放脑子，完全使用ChatGPT，接下来开始我的表演🍋训练+代码注意爬虫守则哈接下来我们问它能不能，并且指明想要爬取的内容接...

文章目录

🍋引言

今天来点有意思的，使用chatGPT如何爬取一个手办网

八、使用BeautifulSoup4解析HTML实战（二）
这篇文章的之前使用bs4进行爬取的文章，今天我们解放脑子，完全使用ChatGPT，接下来开始我的表演

🍋训练+代码

注意爬虫守则哈

接下来我们问它能不能，并且指明想要爬取的内容

接下来我们看看它生成的代码吧

import requests
from bs4 import BeautifulSoup

url = 'https://www.hpoi.net/hobby/all?order=add&category=100'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # 在这里使用BeautifulSoup选择器获取所需信息
    # 请查看目标网页的HTML结构，使用合适的选择器来提取数据

    # 举例：获取所有产品名称
    names = soup.select('.product-name')

    # 举例：获取所有产品厂商
    manufacturers = soup.select('.product-manufacturer')

    # 举例：获取所有产品出荷日期
    ship_dates = soup.select('.product-ship-date')

    # 举例：获取所有产品新增日期
    add_dates = soup.select('.product-add-date')

    # 打印获取到的信息
    for name, manufacturer, ship_date, add_date in zip(names, manufacturers, ship_dates, add_dates):
        print(f'产品名称: {name.text.strip()}')
        print(f'产品厂商: {manufacturer.text.strip()}')
        print(f'出荷日期: {ship_date.text.strip()}')
        print(f'新增日期: {add_date.text.strip()}')
        print('---')

else:
    print(f'请求失败，状态码：{response.status_code}')

当然ChatGPT不会啥都会，比较网页的HTML它不知道，所以上述代码中的一些内容是例子，'.product-name’之类的需要我们自己去在网页找。当然迫不及待的小伙伴会直接运行，但是会发现啥也没有。我们需要再来问问Ai

import requests
from bs4 import BeautifulSoup

url = 'https://www.hpoi.net/hobby/all?order=add&category=100'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # 在这里使用BeautifulSoup选择器获取所需信息
    # 请查看目标网页的HTML结构，使用合适的选择器来提取数据

    # 举例：获取所有产品名称
    names = soup.select('.product-name')

    # 举例：获取所有产品厂商
    manufacturers = soup.select('.product-manufacturer')

    # 举例：获取所有产品出荷日期
    ship_dates = soup.select('.product-ship-date')

    # 举例：获取所有产品新增日期
    add_dates = soup.select('.product-add-date')

    # 打印获取到的信息
    for name, manufacturer, ship_date, add_date in zip(names, manufacturers, ship_dates, add_dates):
        print(f'产品名称: {name.text.strip()}')
        print(f'产品厂商: {manufacturer.text.strip()}')
        print(f'出荷日期: {ship_date.text.strip()}')
        print(f'新增日期: {add_date.text.strip()}')
        print('---')

else:
    print(f'请求失败，状态码：{response.status_code}')

不难发现上述代码加了一个请求头，目的就是让网页把你识别是一个正常用户，之后我们输出还是啥也没有，这时候就开始引导一些Ai了

从Ai的话可以理解，我们需要打印一下看看是HTML压根没有获取到还是什么问题。
我们试过之后发现HTML可以正常的获取，接下来我们就回到上面的问题，这个.product-name假的内容例子，我们需要替换成网页中的，该怎么办呢，这里教大家一招，我们将之前那个HTML保存下来，然后使用Ctrl+F找到一段结构，然后发给Ai让它帮你去做不就完了嘛，废话不多说，开干，我们玩的就是真实

import requests
from bs4 import BeautifulSoup

url = 'https://www.hpoi.net/hobby/all?order=add&category=100'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # 将BeautifulSoup对象保存到文件
    with open('soup.html', 'w', encoding='utf-8') as file:
        file.write(soup.prettify())

    print('BeautifulSoup对象已保存到 soup.html 文件。')

else:
    print(f'请求失败，状态码：{response.status_code}')

保存后，我们来看看HTML

找到一个div，也就是其中一个要爬取的部分

<div class="hpoi-detail-grid-right">
          <div class="hpoi-detail-grid-title hpoi-text-ellipsis">
           <a href="hobby/92265" target="_blank" title="一番赏 五等分的新娘∽ ～五胞胎的庆生～ E奖 中野五月">
            一番赏 五等分的新娘∽ ～五胞胎的庆生～ E奖 中野五月
           </a>
          </div>
          <div class="hpoi-detail-grid-info">
           <span>
            <em>
             厂商：
            </em>
            BANDAI SPIRITS
           </span>
           <span>
            <em>
             出荷：
            </em>
            2024年4月
           </span>
           <span>
            <em>
             新增：
            </em>
            12月12日
           </span>
          </div>
         </div>

然后问Ai

生成如下

import requests
from bs4 import BeautifulSoup

url = 'https://www.hpoi.net/hobby/all?order=add&category=100'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # 获取所有产品信息
    product_elements = soup.select('.hpoi-detail-grid-right')

    # 打印获取到的信息
    for product in product_elements:
        name = product.select_one('.hpoi-detail-grid-title a').text.strip()
        manufacturer = product.select_one('em:contains("厂商：") + *').text.strip()
        ship_date = product.select_one('em:contains("出荷：") + *').text.strip()
        add_date = product.select_one('em:contains("新增：") + *').text.strip()

        print(f'产品名称: {name}')
        print(f'厂商: {manufacturer}')
        print(f'出荷日期: {ship_date}')
        print(f'新增日期: {add_date}')
        print('---')

else:
    print(f'请求失败，状态码：{response.status_code}')

但是会报错，我们接着问Ai就好，因为有的为空

大概需要重复三次，因为厂商、出荷、新增都有为空
最终得到的代码如下

import requests
from bs4 import BeautifulSoup

url = 'https://www.hpoi.net/hobby/all?order=add&category=100'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # 获取所有产品信息
    product_elements = soup.select('.hpoi-detail-grid-right')

    # 打印获取到的信息
    for product in product_elements:
        name = product.select_one('.hpoi-detail-grid-title a').text.strip()
        manufacturer = product.select_one('em:contains("厂商：")')
        ship_date_element = product.select_one('em:contains("出荷：")')

        # 获取 "出荷：" 标签后的所有内容，然后提取文本
        if ship_date_element:
            ship_date_text = ship_date_element.find_next('span').get_text(strip=True)
        else:
            ship_date_text = 'N/A'

        add_date_element = product.select_one('em:contains("新增：")')

        # 获取 "新增：" 标签后的所有内容，然后提取文本
        if add_date_element:
            add_date_text = add_date_element.find_next('span').get_text(strip=True)
        else:
            add_date_text = 'N/A'

        # 检查厂商信息是否存在
        if manufacturer:
            manufacturer_text = manufacturer.find_next('span').text.strip()
        else:
            manufacturer_text = 'N/A'

        print(f'产品名称: {name}')
        print(f'厂商: {manufacturer_text}')
        print(f'出荷日期: {ship_date_text}')
        print(f'新增日期: {add_date_text}')
        print('---')

else:
    print(f'请求失败，状态码：{response.status_code}')

这回就可以打印出三十条数据了

🍋扩展

一般情况下，我们都不光打印一页，一般会好多页，大部分页与页直接会有一些相似处。

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# 循环爬取前三页的数据
for page in range(1, 4):
    url = f'https://www.hpoi.net/hobby/all?order=add&r18=-1&workers=&view=3&category=100&page={page}'

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        # 获取所有产品信息
        product_elements = soup.select('.hpoi-detail-grid-right')

        # 打印获取到的信息
        for product in product_elements:
            name = product.select_one('.hpoi-detail-grid-title a').text.strip()
            manufacturer = product.select_one('em:contains("厂商：")')
            ship_date_element = product.select_one('em:contains("出荷：")')
            
            # 获取 "出荷：" 标签后的所有内容，然后提取文本
            if ship_date_element:
                ship_date_text = ship_date_element.find_next('span').get_text(strip=True)
            else:
                ship_date_text = 'N/A'

            add_date_element = product.select_one('em:contains("新增：")')

            # 获取 "新增：" 标签后的所有内容，然后提取文本
            if add_date_element:
                add_date_text = add_date_element.find_next('span').get_text(strip=True)
            else:
                add_date_text = 'N/A'

            # 检查厂商信息是否存在
            if manufacturer:
                manufacturer_text = manufacturer.find_next('span').text.strip()
            else:
                manufacturer_text = 'N/A'

            print(f'产品名称: {name}')
            print(f'厂商: {manufacturer_text}')
            print(f'出荷日期: {ship_date_text}')
            print(f'新增日期: {add_date_text}')
            print('---')

        print(f'第 {page} 页数据已爬取完毕\n')

    else:
        print(f'请求失败，状态码：{response.status_code}')

之后将打印的内容保存到csv文件就可以了

import requests
from bs4 import BeautifulSoup
import csv

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Open a CSV file for writing
with open('product_data.csv', 'w', newline='', encoding='utf-8') as csvfile:
    # Create a CSV writer object
    csv_writer = csv.writer(csvfile)

    # Write header row to CSV file
    csv_writer.writerow(['产品名称', '厂商', '出荷日期', '新增日期'])

    # Loop through each page
    for page in range(1, 4):
        url = f'https://www.hpoi.net/hobby/all?order=add&r18=-1&workers=&view=3&category=100&page={page}'

        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            # Get all product information
            product_elements = soup.select('.hpoi-detail-grid-right')

            # Loop through each product on the page
            for product in product_elements:
                name = product.select_one('.hpoi-detail-grid-title a').text.strip()
                manufacturer = product.select_one('em:contains("厂商：")')
                ship_date_element = product.select_one('em:contains("出荷：")')

                # Get "出荷：" element's text content
                if ship_date_element:
                    ship_date_text = ship_date_element.find_next('span').get_text(strip=True)
                else:
                    ship_date_text = 'N/A'

                add_date_element = product.select_one('em:contains("新增：")')

                # Get "新增：" element's text content
                if add_date_element:
                    add_date_text = add_date_element.find_next('span').get_text(strip=True)
                else:
                    add_date_text = 'N/A'

                # Check if manufacturer information exists
                if manufacturer:
                    manufacturer_text = manufacturer.find_next('span').text.strip()
                else:
                    manufacturer_text = 'N/A'

                # Print data to console
                print(f'产品名称: {name}')
                print(f'厂商: {manufacturer_text}')
                print(f'出荷日期: {ship_date_text}')
                print(f'新增日期: {add_date_text}')
                print('---')

                # Write data to CSV file
                csv_writer.writerow([name, manufacturer_text, ship_date_text, add_date_text])

            print(f'第 {page} 页数据已爬取完毕\n')

        else:
            print(f'请求失败，状态码：{response.status_code}')

print('数据已保存到 product_data.csv 文件。')

🍋解决保存后的csv文件，使用Excel打开总是乱码

将文件用记事本打开，使用ANSI编码即可，这样打开的文件就不会是乱码了

🍋总结

合理的利用Ai可以极大的提高我们的生产效率，但你也得会点，在自己有点基础的前提去使用会事半功倍。

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入