- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

(4)分布式下的爬虫Scrapy应该如何做-规则自动爬取及命令行下传参

~大鱼~ 发表于 2021/05/27 17:39:39 2021/05/27

【摘要】本次探讨的主题是规则爬取的实现及命令行下的自定义参数的传递，规则下的爬虫在我看来才是真正意义上的爬虫。我们选从逻辑上来看，这种爬虫是如何工作的：   我们给定一个起点的url link ，进入页面之后提取所有的ur 链接，我们定义一个规则，根据规则(用正则表达式来限制)来提取我们想要的连接形式，然后爬取这些页面，进行一步的处理(数据提取或者其它动作)，然后循环上...

本次探讨的主题是规则爬取的实现及命令行下的自定义参数的传递，规则下的爬虫在我看来才是真正意义上的爬虫。

我们选从逻辑上来看，这种爬虫是如何工作的：

我们给定一个起点的url link ，进入页面之后提取所有的ur 链接，我们定义一个规则，根据规则(用正则表达式来限制)来提取我们想要的连接形式，然后爬取这些页面，进行一步的处理(数据提取或者其它动作)，然后循环上述操作，直到停止，这个时候有一个潜在的问题，就是重复爬取，在scrapy 的框架下已经着手处理了这些问题，一般来说，对于爬取过滤的问题，通用的处理方式是建立一个地址表，在爬取之前查一下这个地址表，是否已经爬取过，如果是，则直接过滤掉。另一种就是使用现成的通用解决方案，bloom filter

本次讨论的是如何使用CrawlSpider 来进行爬取豆瓣标签下的所有小组的信息：

新建一个类

继承自CrawlSpider

from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from douban.items import GroupInfo class MySpider(CrawlSpider):

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from douban.items import GroupInfo

class MySpider(CrawlSpider):

关于CrawlSpider的更多说明，请参考：http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

构造参数

为了完成命令行下的参数传递，我们需要在类的构造函数里面输入我们想要的参数

：

在命令行下这样使用：

scrapy crawl douban.xp –logfile=test.log -a target=%E6%96%87%E5%85%B7

这样就可以将自定义的参数传入到里面

这里特别说明最后的一行：super(MySpider, self).init()

我们转到定义，查看CrawlSpider 的定义：

构造函数会调用私有方法编译rules变量，如果在我们自己定义的Spider里面没有调用方法，会直接报错的。

编写规则

self.rules = ( Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True), )

self.rules = (

Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),

)

allow 定义想要提取标签样式，使用正则匹配，restrict_xpaths 严格限制这种标签的范围在指定的标签内，callback ,提取到之后的回调函数。

编写代码

from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from douban.items import GroupInfo class MySpider(CrawlSpider): name = 'douban.xp' current = '' allowed_domains = ['douban.com'] def __init__(self, target=None): if self.current is not '': target = self.current if target is not None: self.current = target self.start_urls = [ 'http://www.douban.com/group/explore?tag=%s' % (target) ] self.rules = ( Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True), ) #call the father base function super(MySpider, self).__init__() def parse_next_page(self, response): self.logger.info(msg='begin init the page %s ' % response.url) list_item = response.xpath('//a[@class="nbg"]') #check the group is not null if list_item is None: self.logger.info(msg='cant select anything in selector ') return for a_item in list_item: item = GroupInfo() item['group_url'] = ''.join(a_item.xpath('@href').extract()) item['group_tag'] = self.current item['group_name'] = ''.join(a_item.xpath('@title').extract()) yield item def parse_start_url(self, response): self.logger.info(msg='begin init the start page %s ' % response.url) list_item = response.xpath('//a[@class="nbg"]') #check the group is not null if list_item is None: self.logger.info(msg='cant select anything in selector ') return for a_item in list_item: item = GroupInfo() item['group_url'] = ''.join(a_item.xpath('@href').extract()) item['group_tag'] = self.current item['group_name'] = ''.join(a_item.xpath('@title').extract()) yield item def parse_next_page_people(self, response): self.logger.info('Hi, this is an the next page! %s', response.url)

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from douban.items import GroupInfo

class MySpider(CrawlSpider):

name = 'douban.xp'

current = ''

allowed_domains = ['douban.com']

def __init__(self, target=None):

if self.current is not '':

target = self.current

if target is not None:

self.current = target

self.start_urls = [

'http://www.douban.com/group/explore?tag=%s' % (target)

]

self.rules = (

Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),

)

#call the father base function

super(MySpider, self).__init__()

def parse_next_page(self, response):

self.logger.info(msg='begin init the page %s ' % response.url)

list_item = response.xpath('//a[@class="nbg"]')

#check the group is not null

if list_item is None:

self.logger.info(msg='cant select anything in selector ')

return

for a_item in list_item:

item = GroupInfo()

item['group_url'] = ''.join(a_item.xpath('@href').extract())

item['group_tag'] = self.current

item['group_name'] = ''.join(a_item.xpath('@title').extract())

yield item

def parse_start_url(self, response):

self.logger.info(msg='begin init the start page %s ' % response.url)

list_item = response.xpath('//a[@class="nbg"]')

#check the group is not null

if list_item is None:

self.logger.info(msg='cant select anything in selector ')

return

for a_item in list_item:

item = GroupInfo()

item['group_url'] = ''.join(a_item.xpath('@href').extract())

item['group_tag'] = self.current

item['group_name'] = ''.join(a_item.xpath('@title').extract())

yield item

def parse_next_page_people(self, response):

self.logger.info('Hi, this is an the next page! %s', response.url)

实际运行

scrapy crawl douban.xp --logfile=test.log -a target=%E6%96%87%E5%85%B7

1 2	scrapy crawl douban.xp --logfile=test.log -a target=%E6%96%87%E5%85%B7

运行结果

总结

本次主要解决两个问题：
如何从命令行下传递参考
如何编写CrawlSpider

里面的演示的功能都比较有限，实际的运行中其实是需要进一步编写其它的规则，比如如何防止被ban，下一篇在简短的介绍下

文章来源: brucedone.com，作者：大鱼的鱼塘，版权归原作者所有，如需转载，请联系作者。

原文链接：brucedone.com/archives/129

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

(4)分布式下的爬虫Scrapy应该如何做-规则自动爬取及命令行下传参

新建一个类

构造参数

编写规则

编写代码

实际运行

运行结果

总结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

(4)分布式下的爬虫Scrapy应该如何做-规则自动爬取及命令行下传参

新建一个类

构造参数

编写规则

编写代码

实际运行

运行结果

总结

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品