Scrapy 爬取西刺代理存入MySQL & MongoDB 数据库(手把手教学,超详细步骤)丨【生长吧!Python】

举报
ruochen 发表于 2021/07/06 16:38:38 2021/07/06
【摘要】 Scrapy 爬取西刺代理存入MySQL & MongoDB 数据库(手把手教学,超详细步骤)

Scrapy 爬取西刺代理

1. 创建项目

  • scrapy startproject XcSpider

2. 创建爬虫实例

先把项目文件夹 Sources Root 一下,防止导入自己的文件时出错

3. 创建一个启动文件 main.py

from scrapy import cmdline
cmdline.execute('scrapy crawl xcdl'.split())

4. 项目的总体树结构

Windows 下查看树结构命令 tree /F(/F 可显示完整文件)

│   main.py
│   scrapy.cfg
│   xcdl.log
│
└───XcSpider
    │   items.py
    │   middlewares.py
    │   pipelines.py
    │   settings.py
    │   __init__.py
    │
    ├───mysqlpipelines
    │   │   pipelines.py
    │   │   sql.py
    │   │   __init__.py
    │   │
    │   └───__pycache__
    │           pipelines.cpython-36.pyc
    │           sql.cpython-36.pyc
    │           __init__.cpython-36.pyc
    │
    ├───spiders
    │   │   xcdl.py
    │   │   __init__.py
    │   │
    │   └───__pycache__
    │           xcdl.cpython-36.pyc
    │           __init__.cpython-36.pyc
    │
    └───__pycache__
            items.cpython-36.pyc
            pipelines.cpython-36.pyc
            settings.cpython-36.pyc
            __init__.cpython-36.pyc

5. settings.py 文件配置

  • 设置MySQL、MongoDB数据相关配置
  • 设置 ITEM_PIPELINES (中间件最后添加),这个后面会有讲解
  • 设置 DEFAULT_REQUEST_HEADERS (头部信息),防止反爬,我们加入头部信息
# -*- coding: utf-8 -*-

# Scrapy settings for XcSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'XcSpider'

SPIDER_MODULES = ['XcSpider.spiders']
NEWSPIDER_MODULE = 'XcSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                ' Chrome/80.0.3987.149 Safari/537.36',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'XcSpider.pipelines.XcspiderPipeline': 300,
    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
    'XcSpider.pipelines.XcPipeline': 200,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 开启日志
LOG_FILE = 'xcdl.log'
LOG_LEVEL = 'ERROR'
LOG_ENABLED = True

# Mysql相关配置
MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_PORT = 3306
MYSQL_DB = 'db_xici'

# MongoDB 相关配置
# MONGODB 主机名
MONGODB_HOST = '127.0.0.1'
# MONGODB 端口号
MONGODB_PORT = 27017
# 数据库名称
MONGODB_DBNAME = 'XCDL'
# 存放数据的表名称
MONGODB_SHEETNAME = 'xicidaili'

6. items.py 文件

  • 编写自己所需爬取的数据
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class XcspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class XiciDailiItem(scrapy.Item):
    country = scrapy.Field()
    ipaddress = scrapy.Field()
    port = scrapy.Field()
    serveraddr = scrapy.Field()
    isanonymous = scrapy.Field()
    type = scrapy.Field()
    alivetime = scrapy.Field()
    verificationtime = scrapy.Field()

7. xcdl.py

  • 进行页面处理,提取需要的数据
# -*- coding: utf-8 -*-
import scrapy
from XcSpider.items import XiciDailiItem

class XcdlSpider(scrapy.Spider):
    name = 'xcdl'
    allowed_domains = ['xicidaili.com']
    start_urls = ['https://www.xicidaili.com/']

    def parse(self, response):
        # print(response.body.decode('utf-8'))
        items_1 = response.xpath('//tr[@class="odd"]')
        items_2 = response.xpath('//tr[@class=""]')
        items = items_1 + items_2

        infos = XiciDailiItem()
        for item in items:
            # 获取国家图片链接
            counties = item.xpath('./td[@class="country"]/img/@src').extract()
            try:
                country = counties[0]
            except:
                country = 'None'
            # 获取ipaddress
            ipaddress = item.xpath('./td[2]/text()').extract()
            try:
                ipaddress = ipaddress[0]
            except:
                ipaddress = 'None'
            # 获取port
            port = item.xpath('./td[3]/text()').extract()
            try:
                port = port[0]
            except:
                port = 'None'
            # 获取serveraddr
            serveraddr = item.xpath('./td[4]/text()').extract()
            try:
                serveraddr = serveraddr[0]
            except:
                serveraddr = 'None'
            # 获取isanonymous
            isanonymous = item.xpath('./td[5]/text()').extract()
            try:
                isanonymous = isanonymous[0]
            except:
                isanonymous = 'None'
            # 获取type
            type = item.xpath('./td[6]/text()').extract()
            try:
                type = type[0]
            except:
                type = 'None'
            # 获取存活时间
            alivetime = item.xpath('./td[7]/text()').extract()
            try:
                alivetime = alivetime[0]
            except:
                alivetime = 'None'
            # 获取验证时间
            verficationtime = item.xpath('./td[8]/text()').extract()
            try:
                verificationtime = verficationtime[0]
            except:
                verificationtime = 'None'

            print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)

            infos['country'] = country
            infos['ipaddress'] = ipaddress
            infos['port'] = port
            infos['serveraddr'] = serveraddr
            infos['isanonymous'] = isanonymous
            infos['type'] = type
            infos['alivetime'] = alivetime
            infos['verificationtime'] = verificationtime


            yield infos

8. pipelines.py

i. 存入MongoDB 数据库

  • 数据我们已经提取出来了,现在我们可以存入数据库了,先写MongoDB 的 pipeline,这里我们直接在项目中的 pipelines.py 文件中编写即可
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from XcSpider import settings

class XcspiderPipeline(object):
    def process_item(self, item, spider):
        return item

class XcPipeline(object):
    def __init__(self):
        host = settings.MONGODB_HOST
        port = settings.MONGODB_PORT
        dbname = settings.MONGODB_DBNAME
        sheetname = settings.MONGODB_SHEETNAME
        # 创建MONGODB数据库连接
        client = pymongo.MongoClient(host=host, port=port)
        # 指定数据库
        mydb = client[dbname]
        # 存放数据的数据库表名
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item

ii. 存入MySQL 数据库

  • 存入MySQL 数据库我们可以自定义自己的 pipelines
  • 在项目文件夹下新建一个 mysqlpipelines 文件夹或者 Package,具体位置可查看前面树结构
  • 首先,我们先编写一个 sql 模板 --> sql.py
# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File   :project -> sql
@IDE    :PyCharm
@Author :ruochen
@Date   :2020/4/3 12:53
@Desc   
=================================================='''
import pymysql
from XcSpider import settings

MYSQL_HOST = settings.MYSQL_HOST
MYSQL_USER = settings.MYSQL_USER
MYSQL_PASSWORD = settings.MYSQL_PASSWORD
MYSQL_PORT = settings.MYSQL_PORT
MYSQL_DB = settings.MYSQL_DB

db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")
cursor = db.cursor()

class Sql(object):

    @classmethod
    def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime):
        sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \
              ' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) '
        value = {
            'country': country,
            'ipaddress': ipaddress,
            'port': port,
            'serveraddr': serveraddr,
            'isanonymous': isanonymous,
            'type': type,
            'alivetime': alivetime,
            'verificationtime': verificationtime,
        }
        try:
            cursor.execute(sql, value)
            db.commit()
        except Exception as e:
            print('插入失败----', e)
            db.rollback()

    # 去重
    @classmethod
    def select_name(cls, ipaddress):
        sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)"
        value = {
            'ipaddress': ipaddress
        }
        cursor.execute(sql, value)
        return cursor.fetchall()[0]

# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File   :project -> pipelines
@IDE    :PyCharm
@Author :ruochen
@Date   :2020/4/3 12:53
@Desc   :
=================================================='''
from XcSpider.items import XiciDailiItem
from .sql import Sql

class XicidailiPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, XiciDailiItem):
            ipaddress = item['ipaddress']
            ret = Sql.select_name(ipaddress)
            if ret[0] == 1:
                print("ip: {} 已经存在啦----".format(ipaddress))
            else:
                country = item['country']
                ipaddress = item['ipaddress']
                port = item['port']
                serveraddr = item['serveraddr']
                isanonymous = item['isanonymous']
                type = item['type']
                alivetime = item['alivetime']
                verificationtime = item['verificationtime']

                Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
                

9. settings.py 中 pipelines 设置

  • 前面 settings.py 文件已经有添加,这里再说一次
  • 一个是MySQL的中间件,一个是MongoDB的中间件
  • 优先级可以随便设置
  • 两个可以同时打开,可也单独打开

这里,给大家提供一个小技巧,我们可以先以导包的形式找到我们的 pipelines ,然后复制过去即可,如下

# from XcSpider.mysqlpipelines.pipelines import XicidailiPipeline
ITEM_PIPELINES = {
   # 'XcSpider.pipelines.XcspiderPipeline': 300,
    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
    'XcSpider.pipelines.XcPipeline': 200,
}

10. 运行程序

  • 现在,我们就可以运行 main.py 文件来启动我们的爬虫程序了
  • 然后就可以在数据库中看到爬取的数据了

存入 M o n g o D B 数据库可以直接运行 \color{red}存入 MongoDB 数据库可以直接运行
存入 M y S Q L 数据库需要先创建数据库和数据表,数据表名称是 x i c i d a i l i ,我把建表语句贴在下面 \color{red}存入MySQL数据库需要先创建数据库和数据表,数据表名称是 xicidaili,我把建表语句贴在下面

Create Table: CREATE TABLE `xicidaili` (
  `id` int(255) unsigned NOT NULL AUTO_INCREMENT,
  `country` varchar(1000) NOT NULL,
  `ipaddress` varchar(1000) NOT NULL,
  `port` int(255) NOT NULL,
  `serveraddr` varchar(50) NOT NULL,
  `isanonymous` varchar(30) NOT NULL,
  `type` varchar(30) NOT NULL,
  `alivetime` varchar(30) NOT NULL,
  `verificationtime` varchar(30) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;

end. 运行结果

MySQL 数据库

在这里插入图片描述

MongoDB 数据库

在这里插入图片描述

【生长吧!Python】有奖征文火热进行中:https://bbs.huaweicloud.com/blogs/278897

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。