Python爬虫:scrapy利用splash爬取动态网页
【摘要】 依赖库:
pip install scrapy-splash1
配置settings.py
# splash服务器地址
SPLASH_URL = 'http://localhost:8050'
# 支持cache_args(可选)
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddlew...
依赖库:
pip install scrapy-splash
- 1
配置settings.py
# splash服务器地址
SPLASH_URL = 'http://localhost:8050'
# 支持cache_args(可选)
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# 下载中间件设置
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}
# 设置去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 启用缓存系统
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
代码示例
将原有Request
替换成 SplashRequest
即可
# -*- coding: utf-8 -*-
import scrapy
from scrapy import cmdline
from scrapy_splash import SplashRequest
class ToscrapeJsSpider(scrapy.Spider): name = "toscrape_js" allowed_domains = ["toscrape.com"] start_urls = ( 'http://quotes.toscrape.com/js/', ) def start_requests(self): for url in self.start_urls: yield SplashRequest(url, args={"timeout": 5, "image": 0, 'wait': '5'}) def parse(self, response): quotes = response.css(".quote .text::text").extract() for quote in quotes: print(quote)
if __name__ == '__main__': cmdline.execute("scrapy crawl toscrape_js".split())
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
SplashRequest
参数说明
url: 待爬页面
headers: 与Request
相同
cookies:与Request
相同
args: {dict}传递给Splash
的参数, wait很重要,需要给js足够的执行时间
cache_args: {list}让Splash
缓存的参数
endpoint: Splash
端点服务,默认render.html
splash_url: Splash
服务器地址,默认为配置文件中的SPLASH_URL
设置代理:
args={ 'proxy': 'http://proxy_ip:proxy_port'
}
- 1
- 2
- 3
文章来源: pengshiyu.blog.csdn.net,作者:彭世瑜,版权归原作者所有,如需转载,请联系作者。
原文链接:pengshiyu.blog.csdn.net/article/details/81625830
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)