Python:Resquest/Response

举报
Lansonli 发表于 2021/09/28 01:03:13 2021/09/28
【摘要】 Request Request 部分源码: # 部分代码class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None,...

Request

Request 部分源码:


  
  1. # 部分代码
  2. class Request(object_ref):
  3. def __init__(self, url, callback=None, method='GET', headers=None, body=None,
  4. cookies=None, meta=None, encoding='utf-8', priority=0,
  5. dont_filter=False, errback=None):
  6. self._encoding = encoding # this one has to be set first
  7. self.method = str(method).upper()
  8. self._set_url(url)
  9. self._set_body(body)
  10. assert isinstance(priority, int), "Request priority not an integer: %r" % priority
  11. self.priority = priority
  12. assert callback or not errback, "Cannot use errback without a callback"
  13. self.callback = callback
  14. self.errback = errback
  15. self.cookies = cookies or {}
  16. self.headers = Headers(headers or {}, encoding=encoding)
  17. self.dont_filter = dont_filter
  18. self._meta = dict(meta) if meta else None
  19. @property
  20. def meta(self):
  21. if self._meta is None:
  22. self._meta = {}
  23. return self._meta

其中,比较常用的参数:


  
  1. url: 就是需要请求,并进行下一步处理的url
  2. callback: 指定该请求返回的Response,由那个函数来处理。
  3. method: 请求一般不需要指定,默认GET方法,可设置为"GET", "POST", "PUT"等,且保证字符串大写
  4. headers: 请求时,包含的头文件。一般不需要。内容一般如下:
  5. # 自己写过爬虫的肯定知道
  6. Host: media.readthedocs.org
  7. User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0
  8. Accept: text/css,*/*;q=0.1
  9. Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
  10. Accept-Encoding: gzip, deflate
  11. Referer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/
  12. Cookie: _ga=GA1.2.1612165614.1415584110;
  13. Connection: keep-alive
  14. If-Modified-Since: Mon, 25 Aug 2014 21:59:35 GMT
  15. Cache-Control: max-age=0
  16. meta: 比较常用,在不同的请求之间传递数据使用的。字典dict型
  17. request_with_cookies = Request(
  18. url="http://www.example.com",
  19. cookies={'currency': 'USD', 'country': 'UY'},
  20. meta={'dont_merge_cookies': True}
  21. )
  22. encoding: 使用默认的 'utf-8' 就行。
  23. dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False
  24. errback: 指定错误处理函数

Response


  
  1. # 部分代码
  2. class Response(object_ref):
  3. def __init__(self, url, status=200, headers=None, body='', flags=None, request=None):
  4. self.headers = Headers(headers or {})
  5. self.status = int(status)
  6. self._set_body(body)
  7. self._set_url(url)
  8. self.request = request
  9. self.flags = [] if flags is None else list(flags)
  10. @property
  11. def meta(self):
  12. try:
  13. return self.request.meta
  14. except AttributeError:
  15. raise AttributeError("Response.meta not available, this response " \
  16. "is not tied to any request")

大部分参数和上面的差不多:


  
  1. status: 响应码
  2. _set_body(body): 响应体
  3. _set_url(url):响应url
  4. self.request = request

发送POST请求

  • 可以使用 yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。

  • 如果希望程序执行一开始就发送POST请求,可以重写Spider类的start_requests(self) 方法,并且不再调用start_urls里的url。


  
  1. class mySpider(scrapy.Spider):
  2. # start_urls = ["http://www.example.com/"]
  3. def start_requests(self):
  4. url = 'http://www.renren.com/PLogin.do'
  5. # FormRequest 是Scrapy发送POST请求的方法
  6. yield scrapy.FormRequest(
  7. url = url,
  8. formdata = {"email" : "mr_mao_hacker@163.com", "password" : "axxxxxxxe"},
  9. callback = self.parse_page
  10. )
  11. def parse_page(self, response):
  12. # do something

模拟登陆

使用FormRequest.from_response()方法模拟用户登录

通常网站通过 实现对某些表单字段(如数据或是登录界面中的认证令牌等)的预填充。

使用Scrapy抓取网页时,如果想要预填充或重写像用户名、用户密码这些表单字段, 可以使用 FormRequest.from_response() 方法实现。

下面是使用这种方法的爬虫例子:


  
  1. import scrapy
  2. class LoginSpider(scrapy.Spider):
  3. name = 'example.com'
  4. start_urls = ['http://www.example.com/users/login.php']
  5. def parse(self, response):
  6. return scrapy.FormRequest.from_response(
  7. response,
  8. formdata={'username': 'john', 'password': 'secret'},
  9. callback=self.after_login
  10. )
  11. def after_login(self, response):
  12. # check login succeed before going on
  13. if "authentication failed" in response.body:
  14. self.log("Login failed", level=log.ERROR)
  15. return
  16. # continue scraping with authenticated session...

知乎爬虫案例参考:

zhihuSpider.py爬虫代码


  
  1. #!/usr/bin/env python
  2. # -*- coding:utf-8 -*-
  3. from scrapy.spiders import CrawlSpider, Rule
  4. from scrapy.selector import Selector
  5. from scrapy.linkextractors import LinkExtractor
  6. from scrapy import Request, FormRequest
  7. from zhihu.items import ZhihuItem
  8. class ZhihuSipder(CrawlSpider) :
  9. name = "zhihu"
  10. allowed_domains = ["www.zhihu.com"]
  11. start_urls = [
  12. "http://www.zhihu.com"
  13. ]
  14. rules = (
  15. Rule(LinkExtractor(allow = ('/question/\d+#.*?', )), callback = 'parse_page', follow = True),
  16. Rule(LinkExtractor(allow = ('/question/\d+', )), callback = 'parse_page', follow = True),
  17. )
  18. headers = {
  19. "Accept": "*/*",
  20. "Accept-Encoding": "gzip,deflate",
  21. "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
  22. "Connection": "keep-alive",
  23. "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
  24. "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
  25. "Referer": "http://www.zhihu.com/"
  26. }
  27. #重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
  28. def start_requests(self):
  29. return [Request("https://www.zhihu.com/login", meta = {'cookiejar' : 1}, callback = self.post_login)]
  30. def post_login(self, response):
  31. print 'Preparing login'
  32. #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
  33. xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
  34. print xsrf
  35. #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
  36. #登陆成功后, 会调用after_login回调函数
  37. return [FormRequest.from_response(response, #"http://www.zhihu.com/login",
  38. meta = {'cookiejar' : response.meta['cookiejar']},
  39. headers = self.headers, #注意此处的headers
  40. formdata = {
  41. '_xsrf': xsrf,
  42. 'email': '1095511864@qq.com',
  43. 'password': '123456'
  44. },
  45. callback = self.after_login,
  46. dont_filter = True
  47. )]
  48. def after_login(self, response) :
  49. for url in self.start_urls :
  50. yield self.make_requests_from_url(url)
  51. def parse_page(self, response):
  52. problem = Selector(response)
  53. item = ZhihuItem()
  54. item['url'] = response.url
  55. item['name'] = problem.xpath('//span[@class="name"]/text()').extract()
  56. print item['name']
  57. item['title'] = problem.xpath('//h2[@class="zm-item-title zm-editable-content"]/text()').extract()
  58. item['description'] = problem.xpath('//div[@class="zm-editable-content"]/text()').extract()
  59. item['answer']= problem.xpath('//div[@class=" zm-editable-content clearfix"]/text()').extract()
  60. return item

Item类设置


  
  1. from scrapy.item import Item, Field
  2. class ZhihuItem(Item):
  3. # define the fields for your item here like:
  4. # name = scrapy.Field()
  5. url = Field() #保存抓取问题的url
  6. title = Field() #抓取问题的标题
  7. description = Field() #抓取问题的描述
  8. answer = Field() #抓取问题的答案
  9. name = Field() #个人用户的名称

setting.py 设置抓取间隔


  
  1. BOT_NAME = 'zhihu'
  2. SPIDER_MODULES = ['zhihu.spiders']
  3. NEWSPIDER_MODULE = 'zhihu.spiders'
  4. DOWNLOAD_DELAY = 0.25 #设置下载间隔为250ms

文章来源: lansonli.blog.csdn.net,作者:Lansonli,版权归原作者所有,如需转载,请联系作者。

原文链接:lansonli.blog.csdn.net/article/details/102931652

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。