- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

(8)分布式下的爬虫Scrapy应该如何做-图片下载(源码放送)

~大鱼~ 发表于 2021/05/28 03:08:19 2021/05/28

【摘要】转载主注明出处：http://www.cnblogs.com/codefish/p/4968260.html   在爬虫中，我们遇到比较多需求就是文件下载以及图片下载，在其它的语言或者框架中，我们可能在经过数据筛选，然后异步的使用文件下载类来达到目的，Scrapy框架中本身已经实现了文件及图片下载的文件，相当的方便，只要几行代码，就可以轻松的搞定下载。下面我将演示如...

转载主注明出处：http://www.cnblogs.com/codefish/p/4968260.html

在爬虫中，我们遇到比较多需求就是文件下载以及图片下载，在其它的语言或者框架中，我们可能在经过数据筛选，然后异步的使用文件下载类来达到目的，Scrapy框架中本身已经实现了文件及图片下载的文件，相当的方便，只要几行代码，就可以轻松的搞定下载。下面我将演示如何使用scrapy下载豆瓣的相册首页内容。

优点介绍

自动去重
异步操作，不会阻塞
可以生成指定尺寸的缩略图
计算过期时间
格式转化

编码过程

定义Item

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy from scrapy import Item,Field class DoubanImgsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls = Field() images = Field() image_paths = Field() pass

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

from scrapy import Item,Field

class DoubanImgsItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

image_urls = Field()

images = Field()

image_paths = Field()

pass

定义spider

#coding=utf-8 from scrapy.spiders import Spider import re from douban_imgs.items import DoubanImgsItem from scrapy.http.request import Request # please pay attention to the encoding of info,otherwise raise error import sys reload(sys) sys.setdefaultencoding('utf8') class download_douban(Spider): name = 'download_douban' def __init__(self, url='152686895', *args, **kwargs): self.allowed_domains = ['douban.com'] self.start_urls = [ 'http://www.douban.com/photos/album/%s/' %(url) ] #call the father base function self.url = url super(download_douban, self).__init__(*args, **kwargs) def parse(self, response): """ :type response: response infomation """ list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract() if list_imgs: item = DoubanImgsItem() item['image_urls'] = list_imgs yield item

#coding=utf-8

from scrapy.spiders import Spider

import re

from douban_imgs.items import DoubanImgsItem

from scrapy.http.request import Request

# please pay attention to the encoding of info,otherwise raise error

import sys

reload(sys)

sys.setdefaultencoding('utf8')

class download_douban(Spider):

name = 'download_douban'

def __init__(self, url='152686895', *args, **kwargs):

self.allowed_domains = ['douban.com']

self.start_urls = [

'http://www.douban.com/photos/album/%s/' %(url) ]

#call the father base function

self.url = url

super(download_douban, self).__init__(*args, **kwargs)

def parse(self, response):

"""

:type response: response infomation

"""

list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()

if list_imgs:

item = DoubanImgsItem()

item['image_urls'] = list_imgs

yield item

定义piepline

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy import Request from scrapy import log class DoubanImgsPipeline(object): def process_item(self, item, spider): return item class DoubanImgDownloadPieline(ImagesPipeline): def get_media_requests(self,item,info): for image_url in item['image_urls']: yield Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.images import ImagesPipeline

from scrapy.exceptions import DropItem

from scrapy import Request

from scrapy import log

class DoubanImgsPipeline(object):

def process_item(self, item, spider):

return item

class DoubanImgDownloadPieline(ImagesPipeline):

def get_media_requests(self,item,info):

for image_url in item['image_urls']:

yield Request(image_url)

def item_completed(self, results, item, info):

image_paths = [x['path'] for ok, x in results if ok]

if not image_paths:

raise DropItem("Item contains no images")

item['image_paths'] = image_paths

return item

定义setting.py，启用item处理器

# Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'douban_imgs.pipelines.DoubanImgDownloadPieline': 300, } IMAGES_STORE='C:\\doubanimgs' IMAGES_EXPIRES = 90

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'douban_imgs.pipelines.DoubanImgDownloadPieline': 300,

}

IMAGES_STORE='C:\\doubanimgs'

IMAGES_EXPIRES = 90

运行效果

github地址：https://github.com/BruceDone/scrapy_demo

文章来源: brucedone.com，作者：大鱼的鱼塘，版权归原作者所有，如需转载，请联系作者。

原文链接：brucedone.com/archives/65

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

(8)分布式下的爬虫Scrapy应该如何做-图片下载(源码放送)

优点介绍