Crawlab — The Ultimate Live Dashboard For Web Crawler

举报
Marvin Zhang 发表于 2020/11/29 17:07:04 2020/11/29
【摘要】 Recently I discovered a very interesting and yet powerful project. Although it only started 6 months ago, this project already has around 2.4k likes. This project just launched in March 2019 and it...
  1. Include a great looking real-time dashboard.
  2. Able to visualize the data crawled and they can be downloaded by just clicking a button.
  3. You can create a crawler by just inputting XPath and URL, the so-called “configurable crawler” (Unfortunately, the latest version v0.3.0 have disabled this feature temporary, reference can be found here (in Chinese)

Pyspider vs Crawlab

Previously I had shared about Pyspider as one of the greatest monitoring scrapper tool, if you haven’t read it, you can read through it by clicking the link below.

Why Pyspider May Be One of the Best Scraping Dashboard For Beginner

Pyspider — A Practical Usage on Competitor Monitoring Metrics

towardsdatascience.com


  1. Crawlab is better is you want to integrate different languages or web crawler frameworks.
  2. Crawlab is written in Golang, which is generally more efficient and faster.

Case study — Integrate Scrapy spider to Crawlab

Part 1 — Install Crawlab

Prerequisite — Install Docker on your laptop.

version: '3.3'
services:
master:
image: tikazyq/crawlab:latest
container_name: master
environment:
CRAWLAB_API_ADDRESS: "localhost:8000"
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis"
ports:
- "8080:8080" # frontend
- "8000:8000" # backend
depends_on:
- mongo
- redis
mongo:
image: mongo:latest
restart: always
ports:
- "27017:27017"
redis:
image: redis:latest
restart: always
ports:
- "6379:6379"

Part 2 — Launch Crawlab and Log In

Navigate to localhost:8080 on your browser and you will be able to see the login page as shown below.

Image for post
Image for post

Part 3 — Upload Scrapy Project

Go to this URL and then click Add Spider button as shown in the screenshot below.

Image for post
Image for post
Image for post
Image for post

Part 4 — Obtain the IP address of the MongoDB

Retrieve the docker ID of this docker image: mongo:latest. You can view the docker id by using the command below.

docker ps
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <input your docker id here>

Part 5 — Input the IP address and modify pipelines.py

import osfrom pymongo import MongoClientMONGO_HOST ='172.18.0.2'
MONGO_PORT = 27017
MONGO_DB = 'crawlab_test'
class GadgetsnowPipeline(object):
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
db = mongo[MONGO_DB]
col_name = os.environ.get('CRAWLAB_COLLECTION')
if not col_name:
col_name = 'test'
col = db[col_name]
def process_item(self, item, spider):
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
self.col.save(item)
return item
  1. Copy the above process_item function and replace it in your original pipelines.py file.
  2. The value for MONGO_DB can be any of your desired database name in the MongoDB, for my case I set it as crawlab_test .

Part 6 — Add two new fields in items.py

task_id = scrapy.Field()
_id = scrapy.Field()

Part 7 — Run your spider

Image for post
Image for post
scrapy crawl gdgnow

Part 9 — Visualize the result

Image for post
Image for post

Final Thoughts

Image for post
Image for post
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。