本次探讨的主题是规则爬取的实现及命令行下的自定义参数的传递,规则下的爬虫在我看来才是真正意义上的爬虫。
我们选从逻辑上来看,这种爬虫是如何工作的:
我们给定一个起点的url link ,进入页面之后提取所有的ur 链接,我们定义一个规则,根据规则(用正则表达式来限制)来提取我们想要的连接形式,然后爬取这些页面,进行一步的处理(数据提取或者其它动作),然后循环上述操作,直到停止,这个时候有一个潜在的问题,就是重复爬取,在scrapy 的框架下已经着手处理了这些问题,一般来说,对于爬取过滤的问题,通用的处理方式是建立一个地址表,在爬取之前查一下这个地址表,是否已经爬取过,如果是,则直接过滤掉。另一种就是使用现成的通用解决方案,bloom filter
本次讨论的是如何使用CrawlSpider 来进行爬取豆瓣标签下的所有小组的信息:
新建一个类
继承自CrawlSpider
from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from douban.items import GroupInfo class MySpider(CrawlSpider):
|
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from douban.items import GroupInfo
class MySpider(CrawlSpider):
|
关于CrawlSpider的更多说明,请参考:http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
构造参数
为了完成命令行下的参数传递,我们需要在类的构造函数里面输入我们想要的参数
:
在命令行下这样使用:
scrapy crawl douban.xp –logfile=test.log -a target=%E6%96%87%E5%85%B7
这样就可以将自定义的参数传入到里面
这里特别说明最后的一行:super(MySpider, self).init()
我们转到定义,查看CrawlSpider 的定义:
构造函数会调用私有方法编译rules变量,如果在我们自己定义的Spider里面没有调用方法,会直接报错的。
编写规则
self.rules = ( Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True), )
|
self.rules = (
Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),
)
|
allow 定义想要提取标签样式,使用正则匹配,restrict_xpaths 严格限制这种标签的范围在指定的标签内,callback ,提取到之后的回调函数。
编写代码
from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from douban.items import GroupInfo class MySpider(CrawlSpider): name = 'douban.xp' current = '' allowed_domains = ['douban.com'] def __init__(self, target=None): if self.current is not '': target = self.current if target is not None: self.current = target self.start_urls = [ 'http://www.douban.com/group/explore?tag=%s' % (target) ] self.rules = ( Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True), ) #call the father base function super(MySpider, self).__init__() def parse_next_page(self, response): self.logger.info(msg='begin init the page %s ' % response.url) list_item = response.xpath('//a[@class="nbg"]') #check the group is not null if list_item is None: self.logger.info(msg='cant select anything in selector ') return for a_item in list_item: item = GroupInfo() item['group_url'] = ''.join(a_item.xpath('@href').extract()) item['group_tag'] = self.current item['group_name'] = ''.join(a_item.xpath('@title').extract()) yield item def parse_start_url(self, response): self.logger.info(msg='begin init the start page %s ' % response.url) list_item = response.xpath('//a[@class="nbg"]') #check the group is not null if list_item is None: self.logger.info(msg='cant select anything in selector ') return for a_item in list_item: item = GroupInfo() item['group_url'] = ''.join(a_item.xpath('@href').extract()) item['group_tag'] = self.current item['group_name'] = ''.join(a_item.xpath('@title').extract()) yield item def parse_next_page_people(self, response): self.logger.info('Hi, this is an the next page! %s', response.url)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
|
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from douban.items import GroupInfo
class MySpider(CrawlSpider):
name = 'douban.xp'
current = ''
allowed_domains = ['douban.com']
def __init__(self, target=None):
if self.current is not '':
target = self.current
if target is not None:
self.current = target
self.start_urls = [
'http://www.douban.com/group/explore?tag=%s' % (target)
]
self.rules = (
Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),
)
#call the father base function
super(MySpider, self).__init__()
def parse_next_page(self, response):
self.logger.info(msg='begin init the page %s ' % response.url)
list_item = response.xpath('//a[@class="nbg"]')
#check the group is not null
if list_item is None:
self.logger.info(msg='cant select anything in selector ')
return
for a_item in list_item:
item = GroupInfo()
item['group_url'] = ''.join(a_item.xpath('@href').extract())
item['group_tag'] = self.current
item['group_name'] = ''.join(a_item.xpath('@title').extract())
yield item
def parse_start_url(self, response):
self.logger.info(msg='begin init the start page %s ' % response.url)
list_item = response.xpath('//a[@class="nbg"]')
#check the group is not null
if list_item is None:
self.logger.info(msg='cant select anything in selector ')
return
for a_item in list_item:
item = GroupInfo()
item['group_url'] = ''.join(a_item.xpath('@href').extract())
item['group_tag'] = self.current
item['group_name'] = ''.join(a_item.xpath('@title').extract())
yield item
def parse_next_page_people(self, response):
self.logger.info('Hi, this is an the next page! %s', response.url)
|
实际运行
scrapy crawl douban.xp --logfile=test.log -a target=%E6%96%87%E5%85%B7
|
scrapy crawl douban.xp --logfile=test.log -a target=%E6%96%87%E5%85%B7
|
运行结果
总结
- 本次主要解决两个问题:
- 如何从命令行下传递参考
- 如何编写CrawlSpider
里面的演示的功能都比较有限,实际的运行中其实是需要进一步编写其它的规则,比如如何防止被ban,下一篇在简短的介绍下
文章来源: brucedone.com,作者:大鱼的鱼塘,版权归原作者所有,如需转载,请联系作者。
原文链接:brucedone.com/archives/129
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
评论(0)