- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

Crawlab — The Ultimate Live Dashboard For Web Crawler

Marvin Zhang 发表于 2020/11/29 17:07:04 2020/11/29

【摘要】 Recently I discovered a very interesting and yet powerful project. Although it only started 6 months ago, this project already has around 2.4k likes. This project just launched in March 2019 and it...

Recently I discovered a very interesting and yet powerful project. Although it only started 6 months ago, this project already has around 2.4k likes. This project just launched in March 2019 and it seems promising for the following points.

Able to monitor different kinds of language for web crawlers. For example, Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.
Include a great looking real-time dashboard.
Able to visualize the data crawled and they can be downloaded by just clicking a button.
You can create a crawler by just inputting XPath and URL, the so-called “configurable crawler” (Unfortunately, the latest version v0.3.0 have disabled this feature temporary, reference can be found here (in Chinese)

Pyspider vs Crawlab

Previously I had shared about Pyspider as one of the greatest monitoring scrapper tool, if you haven’t read it, you can read through it by clicking the link below.

Why Pyspider May Be One of the Best Scraping Dashboard For Beginner

Pyspider — A Practical Usage on Competitor Monitoring Metrics

towardsdatascience.com

Well, they do share some similarities, for example, both of them are great dashboard for crawlers, they can be scheduled, have an impressive dashboard to visualize… but, if you want to know significant differences, here you go:

Pyspider is better in terms of visualizing the journey of crawling websites.
Crawlab is better is you want to integrate different languages or web crawler frameworks.
Crawlab is written in Golang, which is generally more efficient and faster.

Case study — Integrate Scrapy spider to Crawlab

Part 1 — Install Crawlab

Prerequisite — Install Docker on your laptop.

version: '3.3'
services:
  master: 
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      CRAWLAB_API_ADDRESS: "localhost:8000"
      CRAWLAB_SERVER_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    ports:    
      - "8080:8080" # frontend
      - "8000:8000" # backend
    depends_on:
      - mongo
      - redis
  mongo:
    image: mongo:latest
    restart: always
    ports:
      - "27017:27017"
  redis:
    image: redis:latest
    restart: always
    ports:
      - "6379:6379"

Copy the above code and save it as docker-compose.yml. Then in the same directory, type the command docker-compose up in your terminal. The docker image will be downloaded to your local.

Part 2 — Launch Crawlab and Log In

Navigate to localhost:8080 on your browser and you will be able to see the login page as shown below.

Default username: admin

Default password: admin

Part 3 — Upload Scrapy Project

Go to this URL and then click Add Spider button as shown in the screenshot below.

I am using my crawler for gadgets now website.

Above snapshot is my scrapy spider directory, go one level down to the directory that contains scrapy.cfg (as highlighted in a red box), and then zip the 3 items. Lastly, upload the zip file.

Part 4 — Obtain the IP address of the MongoDB

Retrieve the docker ID of this docker image: mongo:latest. You can view the docker id by using the command below.

docker ps

Then input the docker id as stated in the command below:

docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <input your docker id here>

Next, you will get the IP address of your MongoDB in the docker container. In my case, the IP address is 172.18.0.2.

Part 5 — Input the IP address and modify pipelines.py

import osfrom pymongo import MongoClientMONGO_HOST ='172.18.0.2'
MONGO_PORT = 27017 
MONGO_DB = 'crawlab_test'class GadgetsnowPipeline(object):
    mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
    db = mongo[MONGO_DB]
    col_name = os.environ.get('CRAWLAB_COLLECTION') 
    if not col_name:
        col_name = 'test'
    col = db[col_name]def process_item(self, item, spider):
        item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
        self.col.save(item)
        return item

This is the modified Python script for pipelines.py. Here are some points that I would like to highlight:

Input the MongoDB’s IP address: MONGO_HOST = ‘ 172.18.0.2’, which we obtain previously.
Copy the above process_item function and replace it in your original pipelines.py file.
The value for MONGO_DB can be any of your desired database name in the MongoDB, for my case I set it as crawlab_test .

Part 6 — Add two new fields in items.py

task_id = scrapy.Field()
_id = scrapy.Field()

These two fields are required to be added to items.py.

Firstly, task_id is the identifier to each task you have executed, you can view it in spider -> spider_name -> any of the tasks -> overview tab.

Secondly, _id is the unique identifier for each object in your MongoDB.

Part 7 — Run your spider

Click into your newly uploaded spider, then input the execute command. Since the name of my Scrapy crawler is gdgnow, therefore my command would be:

scrapy crawl gdgnow

Then, click the Save button first and follow by Run button to start scraping.

Part 9 — Visualize the result

By using my crawler as an example, snapshot above shows the output of my crawler and most importantly, you can download in CSV format by just clicking the Download CSV button.

For the item_desc field, it is showing undefined because my item_desc is in JSON format, but Crawlab is not supporting the output JSON field yet. If you would like to have the JSON field in your output data, so far the only option is to log in to the docker that contains MongoDB where crawlab stream its data to and extract them from there.

The display is pretty amazing, kudos to all the developers in Crawlab!

Final Thoughts

Thank you so much for your patience to read until the end. Crawlab is still in an early stage but it is a very promising framework for crawler especially in monitoring multiples web crawlers.

Since this is just a brief introduction for Crawlab, I haven’t included all the functions for Crawlab, for instance, cron job, how to integrate other web crawlers’ frameworks, etc. If you do want me to share more about Crawlab, comment below, I will create another post for that!

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

Crawlab — The Ultimate Live Dashboard For Web Crawler

Pyspider vs Crawlab

Why Pyspider May Be One of the Best Scraping Dashboard For Beginner

Pyspider — A Practical Usage on Competitor Monitoring Metrics

towardsdatascience.com

Case study — Integrate Scrapy spider to Crawlab

Part 1 — Install Crawlab

Part 2 — Launch Crawlab and Log In

Part 3 — Upload Scrapy Project

Part 4 — Obtain the IP address of the MongoDB

Part 5 — Input the IP address and modify pipelines.py

Part 6 — Add two new fields in items.py

Part 7 — Run your spider

Part 9 — Visualize the result

Final Thoughts

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

Crawlab — The Ultimate Live Dashboard For Web Crawler

Pyspider vs Crawlab

Why Pyspider May Be One of the Best Scraping Dashboard For Beginner

Pyspider — A Practical Usage on Competitor Monitoring Metrics

towardsdatascience.com

Case study — Integrate Scrapy spider to Crawlab

Part 1 — Install Crawlab

Part 2 — Launch Crawlab and Log In

Part 3 — Upload Scrapy Project

Part 4 — Obtain the IP address of the MongoDB

Part 5 — Input the IP address and modify pipelines.py

Part 6 — Add two new fields in items.py

Part 7 — Run your spider

Part 9 — Visualize the result

Final Thoughts

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品