webmagic+selenium+tesseract-ocr实现米扑代理爬取
👨🏻🎓博主介绍:大家好,我是芝士味的椒盐,一名在校大学生,热爱分享知识,很高兴在这里认识大家🌟
🌈擅长领域:Java、大数据、运维、电子
🙏🏻如果本文章各位小伙伴们有帮助的话,🍭关注+👍🏻点赞+🗣评论+📦收藏,相应的有空了我也会回访,互助!!!
🤝另本人水平有限,旨在创作简单易懂的文章,在文章描述时如有错,恳请各位大佬指正,在此感谢!!!
目录
WebMagic简介
webmagic是不需要配置,便捷数据挖掘的爬虫框架,其拥有简单且灵活的api。webmagic整体采用模块化架构,整个爬虫的生命周期:提取连接——>页面下载——>内容提取——>数据持久化,并且支持多线程挖掘,支持分布式挖掘,支持自动重试,自定义cookies,模块可定制化等功能。
Selenium简介
selenium是一款遵守Apache License 2.0协议的开源框架,用于Web程序自动化测试工具,selenium测试运行在浏览器中,就像真的用户在操作一样,包括Firefox、Safari、Chrome、Opera等等。
Tesseract-OCR简介
一款由HP实验室开发由Google维护的开源OCR引擎,与MODI相比,可以不断的训练的库,使图像转文本的能力不断增强。
一、项目需求
总所周知,在数据挖掘领域,其中及其重要的就是数据的爬取,而在大数据时代的到来之后对数据量的需求更加的大,这迫使爬虫需要在短时间内爬取更多的数据,但是现在许多的网站都设置了反扒机制,最常见的反扒机制就是封锁段时间请求过多的ip,而解决方式之一就是使用代理服务器,通过请求道代理服务器,代理服务器去请求目的地站点,这样即使被封ip也是代理服务器被封锁,而我们通常没有那么多代理服务器,市面上有许多的代理服务商,比如我们今天要爬取的对象米扑代理,它虽然有收费的代理服务器,但是它的免费代理也是可以用的,而我们的任务就是爬去ip、端口、类型、匿名度、国家(省市)、运营商、响应时间、传输速度、验证日期等等。
项目成品gitee地址:mipuproxy: webmagic+selenium+tesseract-ocr实现米扑代理代理爬去
二、技术可行性分析
先上图,
很显然除了端口之外还是比较好处理的,由于端口是一张图片,就联想到使用tesseract-ocr进行识别,为了可以直观的无阻碍的模拟人访问使用selenium进行辅助模拟人的操作,整体的爬虫系统使用基于Java编写的WebMagic实现。
三、技术实施
项目整体使用SpringBoot工程化,如下图:
分层明确,dao层为数据访问层、entity为数据库实体、service服务层、以及webmagic的任务层 。
如下为本次项目maven所需的包的坐标:
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<java.version>1.8</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-tomcat</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-devtools</artifactId>
<scope>runtime</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
<!-- <dependency>-->
<!-- <groupId>com.google.guava</groupId>-->
<!-- <artifactId>guava</artifactId>-->
<!-- <version>23.0</version>-->
<!-- </dependency>-->
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-selenium</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
</dependency>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.4</version>
<exclusions>
<exclusion>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
</exclusion>
<exclusion>
<groupId>net.sourceforge.lept4j</groupId>
<artifactId>lept4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>4.4.0</version>
</dependency>
<dependency>
<groupId>net.sourceforge.lept4j</groupId>
<artifactId>lept4j</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid-spring-boot-starter</artifactId>
<version>1.2.6</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<excludes>
<exclude>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</exclude>
</excludes>
</configuration>
</plugin>
</plugins>
</build>
dao层接口:
package icu.smile.proxy.dao;
import icu.smile.proxy.entity.ProxyMiPu;
import org.springframework.data.jpa.repository.JpaRepository;
/**
* <p>
* dao层
* </p>
*
* @author starrysky
* @since 2021/6/7
*/
public interface MiPuDao extends JpaRepository<ProxyMiPu, Long> {
}
与数据库交互用的实体类:
package icu.smile.proxy.entity;
import lombok.Data;
import lombok.experimental.Accessors;
import javax.persistence.*;
/**
* <p>
* 封装实体
* </p>
*
* @author starrysky
* @since 2021/6/6
*/
@Entity
@Table(name = "proxy_mipu")
@Data
@Accessors(chain = true)
public class ProxyMiPu {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
private String ip;
private Integer port;
private String type;
private String anonymous;
private String location;
private String operator;
private String responseTime;
private String transmissionTime;
private String verificationTime;
}
service服务:
package icu.smile.proxy.service;
import icu.smile.proxy.entity.ProxyMiPu;
import java.util.List;
/**
* <p>
* 服务层
* </p>
*
* @author starrysky
* @since 2021/6/7
*/
public interface MiPuService {
void save(ProxyMiPu proxyMiPu);
List<ProxyMiPu> findAll(ProxyMiPu proxyMiPu);
void saveAll(List<ProxyMiPu> entityList);
}
age icu.smile.proxy.service.impl;
import icu.smile.proxy.dao.MiPuDao;
import icu.smile.proxy.entity.ProxyMiPu;
import icu.smile.proxy.service.MiPuService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.Example;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Propagation;
import org.springframework.transaction.annotation.Transactional;
import javax.persistence.EntityManager;
import javax.persistence.PersistenceContext;
import java.util.List;
import java.util.logging.Logger;
/**
* <p>
* TODO
* </p>
*
* @author starrysky
* @since 2021/6/7
*/
@Service
public class MiPuServiceImpl implements MiPuService {
@Autowired
private MiPuDao miPuDao;
@PersistenceContext
protected EntityManager entityManager;
private static final Logger LOGGER = Logger.getLogger(MiPuServiceImpl.class.getName());
@Override
@Transactional
public void save(ProxyMiPu proxyMiPu) {
ProxyMiPu miPum = new ProxyMiPu();
miPum.setIp(proxyMiPu.getIp());
miPum.setPort(proxyMiPu.getPort());
List<ProxyMiPu> list = this.findAll(miPum);
if (list.size()==0){
this.miPuDao.saveAndFlush(proxyMiPu);
}
}
@Override
@Transactional(propagation = Propagation.REQUIRED)
public void saveAll(List<ProxyMiPu> entityList){
miPuDao.saveAll(entityList);
}
@Override
public List<ProxyMiPu> findAll(ProxyMiPu proxyMiPu) {
Example example = Example.of(proxyMiPu);
List<ProxyMiPu> list = this.miPuDao.findAll(example);
return list;
}
}
webmagic的task任务:
package icu.smile.proxy.task;
import org.apache.commons.lang3.StringUtils;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.logging.Logger;
/**
* <p>
* 爬虫处理器
* </p>
*
* @author starrysky
* @since 2021/6/6
*/
@Component
public class MiPuPageProcessor implements PageProcessor {
private static final Logger LOGGER = Logger.getLogger(MiPuPageProcessor.class.getName());
private static String cmdPrefix = "tesseract ";
private static String cmdSuffix = " stdout";
private static Process process = null;
private static BufferedReader bufferedReader = null;
private static String ImageResultOCR = null;
private Site site = Site.me()
.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36")
.setDomain("proxy.mimvp.com")
.addCookie("UMIMVPSESSID", "hct11388aottkqvlgkoecsqna6")
.addCookie("Hm_lvt_51e3cc975b346e7705d8c255164036b3", "1622948301")
.addCookie("Hm_lpvt_51e3cc975b346e7705d8c255164036b3", "1622949348")
.setCharset("UTF-8")
.setTimeOut(5000)
.setRetrySleepTime(1000)
.setRetryTimes(3);
/***
* <p>
* 总的处理方法
* </p>
* @author starrysky
* @since 2021/6/6 22:14
* @param page page页面
* @return void 无返回值
*/
@Override
public void process(Page page) {
final List<String> proxyTabes = proxyTabes(page);
page.addTargetRequests(proxyTabes);
page.putField("ip", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-ip']/text()").all());
page.putField("port", proxyPort(page));
page.putField("type", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-type']/text()").all());
page.putField("anonymous", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-anonymous']/text()").all());
page.putField("location", fixLocation(page));
page.putField("operator", fixOperator(page));
page.putField("responseTime", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-pingtime']/@title").all());
page.putField("transmissionTime", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-transfertime']/@title").all());
page.putField("verificationTime", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-checkdtime']/text()").all());
page.addTargetRequests(fixSmileUrl(page));
page.addTargetRequests(fixListnav(page));
}
/***
* <p>
* 将page中的三个代理分类的url抽出
* </p>
* @author starrysky
* @since 2021/6/6 22:11
* @param page page页面
* @return java.util.List<java.lang.String> 存储三个代理方式的分类url
*/
public List<String> proxyTabes(Page page) {
List<String> tabes = new ArrayList<>();
for (String value : page.getHtml().css("div.free-proxytype-tabs").xpath("//a/@href").all()) {
tabes.add(page.getUrl().toString().substring(0, 23) + value);
}
return tabes;
}
/***
* <p>
* 将页面中的记录着代理端口的图片抽出
* </p>
* @author starrysky
* @since 2021/6/6 22:12
* @param page page页面
* @return java.util.List<java.lang.Integer> 存储转换之后的端口
*/
public List<Integer> proxyPort(Page page) {
List<Integer> protoPort = new ArrayList<>();
for (String value : page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-port']//img/@src").all()) {
protoPort.add(convertImageInteger(page.getUrl().toString().substring(0, 23) + value));
}
return protoPort;
}
/***
* <p>
* 将从图片中提取出来
* </p>
* @author starrysky
* @since 2021/6/6 22:13
* @param url 每一个记录着端口的图片url
* @return java.lang.Integer 端口号
*/
public synchronized Integer convertImageInteger(String url) {
try {
process = Runtime.getRuntime().exec(cmdPrefix + url + cmdSuffix);
bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
ImageResultOCR = bufferedReader.readLine()
.replace(",", "")
.replace(".", "")
.replace("!", "")
.replace("@", "")
.replace("#", "")
.replace("$", "")
.replace("%", "")
.replace("^", "")
.replace("&", "")
.replace("*", "");
//处理8080被识别成为B080情况
if (ImageResultOCR.contains("B") && ImageResultOCR.length() != 3) {
ImageResultOCR = ImageResultOCR.replace("B", "8");
//处理三位的端口其本来为两位,去除被干扰的B,比如B80,这个方法有点风险后期会改进
} else if (ImageResultOCR.contains("B") && ImageResultOCR.length() == 3) {
ImageResultOCR = ImageResultOCR.replace("B", "");
} else if (ImageResultOCR.contains("s") && ImageResultOCR.contains("e")) {
ImageResultOCR = ImageResultOCR.replace("s", "5").replace("e", "2");
} else if (ImageResultOCR.contains("s")) {
ImageResultOCR = ImageResultOCR.replace("s", "5");
} else if (ImageResultOCR.contains("e")) {
ImageResultOCR = ImageResultOCR.replace("e", "2");
}
return (Integer) Integer.parseInt(ImageResultOCR);
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
ImageResultOCR = "0";
LOGGER.info("OCR识别出错,将使用0填充端口项目.");
}
return null;
}
public List<String> fixLocation(Page page) {
List<String> location = new ArrayList<>();
for (String loc : page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-country']/text()").all()) {
location.add(loc.replace("(", "").replace(")", "").trim());
}
return location;
}
/***
* <p>
* 修复运营商描述
* </p>
* @author starrysky
* @since 2021/6/6 23:22
* @param page page页面
* @return java.util.List<java.lang.String> 返回运营商的描述
*/
public List<String> fixOperator(Page page) {
List<String> operator = new ArrayList<>();
for (String oper : page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-isp']/text()").all()) {
operator.add(oper == null ? "暂无运营商" : oper);
}
return operator;
}
/***
* <p>
* 处理列表页面
* </p>
* @author starrysky
* @since 2021/6/6 23:23
* @param page page页面
* @return java.util.List<java.lang.String> 列表页面
*/
public List<String> fixListnav(Page page) {
List<String> listnv = new ArrayList<>();
//判断列表寻址中包含...则需要根据收src和尾部src数字生成List
if (page.getHtml().css("div#listnav").css("ul").xpath("//li/text()").all().contains("...")) {
//构造条件【/freesecret?proxy,in_hp&sort=&pag,1】
final String[] origin = page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().get(0).toString().split("=");
// 【/freesecret?proxy,in_hp&sort=&pag]
final String[] urlBody = Arrays.copyOf(origin,origin.length-1);
final String[] firstElement = page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().get(0).toString().split("=");
//获取1 这个id
int firstId = Integer.parseInt(firstElement[firstElement.length-1]);
final String[] lastElement = page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().get(page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().size() - 1).toString().split("=");
//获取最后一个48 这个id
int lastId = Integer.parseInt(lastElement[lastElement.length-1]);
// 使用firstId、lastId构造for生成全部的url
for (int i = firstId; i <= lastId ; i++) {
listnv.add("https://proxy.mimvp.com/"+StringUtils.join(urlBody,"=")+"="+i);
}
return listnv;
}
//没有...说明页面少,直接提取生成完整url
for (String url:page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all()){
listnv.add("https://proxy.mimvp.com/"+url);
}
return listnv;
}
public List<String> fixSmileUrl(Page page){
List<String> smileurl = new ArrayList<>();
for (String url:page.getHtml().xpath("//div[@class='free-httptype-tabs']//a/@href").all()){
smileurl.add("https://proxy.mimvp.com/"+url);
}
return smileurl;
}
@Override
public Site getSite() {
return site;
}
}
package icu.smile.proxy.task;
import icu.smile.proxy.entity.ProxyMiPu;
import icu.smile.proxy.service.MiPuService;
import lombok.extern.java.Log;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.util.ArrayList;
import java.util.List;
import java.util.logging.Logger;
/**
* <p>
* 持久化层
* </p>
*
* @author starrysky
* @since 2021/6/6
*/
@Component
@Log
public class MiPuPipeline implements Pipeline {
@Autowired
private MiPuService miPuService;
private volatile static int curIndex = 0;
private static final Logger LOGGER = Logger.getLogger(MiPuPipeline.class.getName());
@Override
public void process(ResultItems resultItems, Task task) {
final ProxyMiPu proxyMiPu = new ProxyMiPu();
final List<String> ip = resultItems.get("ip");
final List<Integer> port = resultItems.get("port");
final List<String> type = resultItems.get("type");
final List<String> anonymous = resultItems.get("anonymous");
final List<String> location = resultItems.get("location");
final List<String> operator = resultItems.get("operator");
final List<String> responseTime = resultItems.get("responseTime");
final List<String> transmissionTime = resultItems.get("transmissionTime");
final List<String> verificationTime = resultItems.get("verificationTime");
final List<ProxyMiPu> entityList = new ArrayList<>();
if (ip.size() == 0 || port.size() == 0) {
return;
}
for (int i = curIndex; i <= ip.size() - 1; i++) {
proxyMiPu.setIp(ip.get(i))
.setPort(port.get(i))
.setType(type.size() == 0 ? "页面无类型描述" : type.get(i))
.setAnonymous(anonymous.size() == 0 ? "页面无描述" : anonymous.get(i))
.setLocation(location.size() == 0 ? "页面无地址描述" : location.get(i))
.setOperator(operator.size() == 0 ? "页面无运营商描述" : operator.get(i))
.setResponseTime(responseTime.size() == 0 ? "页面无响应时间描述" : responseTime.get(i))
.setTransmissionTime(transmissionTime.size() == 0 ? "页面无传输时间描述" : transmissionTime.get(i))
.setVerificationTime(verificationTime.size() == 0 ? "页面无验证时间描述" : verificationTime.get(i));
entityList.add(proxyMiPu);
LOGGER.info(proxyMiPu.toString());
}
miPuService.saveAll(entityList);
curIndex = 0;
}
}
SpringBoot启动类:
package icu.smile;
import icu.smile.proxy.task.MiPuPageProcessor;
import icu.smile.proxy.task.MiPuPipeline;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.ConfigurableApplicationContext;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.downloader.selenium.SeleniumDownloader;
import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
import us.codecraft.webmagic.scheduler.QueueScheduler;
/**
* <p>
* 爬去米扑代理
* </p>
*
* @author starrysky
* @since 2021/6/6
*/
@SpringBootApplication
public class MiPuProxyApplication {
private static String URL = "https://proxy.mimvp.com/freeopen";
public static void main(String[] args) {
final ConfigurableApplicationContext ctx = SpringApplication.run(MiPuProxyApplication.class, args);
System.setProperty("selenuim_config","src/main/resources/config.ini");
Spider.create(new MiPuPageProcessor())
.addUrl(URL)
.setDownloader(new SeleniumDownloader())
.addPipeline(ctx.getBean(MiPuPipeline.class))
.setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(10000000)))
.thread(1).runAsync();
}
}
resources资源:
application.yml:
server:
port: 8849
spring:
datasource:
driver-class-name: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://47.111.237.28:3388/mipuproxy?characterEncoding=utf8&useSSL=false&serverTimezone=Asia/Shanghai
username: root
password: root
type: com.alibaba.druid.pool.DruidDataSource
# 连接池配置
druid:
# 初始化连接池连接数量
initial-size: 5
# 最小连接数量
min-idle: 5
# 最大连接池数量
max-active: 20
# 配置获取练级等待超时时间
max-wait: 60000
# 配置间隔多久进行一次检测时间,检测时需要关闭空闲时间,单位为毫秒
time-between-eviction-runs-millis: 60000
# 配置连接池最小生存时间
min-evictable-idle-time-millis: 30000
validation-query: SELECT 1 FROM DUAL
test-while-idle: true
test-on-borrow: true
test-on-return: false
# 是否缓存preparedStatement,也就是PSCache 官方建议MySQL下建议关闭 个人建议如果想用SQL防火墙 建议打开
pool-prepared-statements: true
max-pool-prepared-statement-per-connection-size: 20
# 配置监控统计拦截的filters,去掉后监控界面sql无法统计,'wall'用于防火墙
filter:
stat:
merge-sql: true
slow-sql-millis: 5000
# 基础监控配置
web-stat-filter:
enabled: true
url-pattern: /*
# 设置不统计哪些URL
exclusions: "*.js,*.gif,*.jpg,*.png,*.css,*.ico,/druid/*"
session-stat-enable: true
session-stat-max-count: 100
stat-view-servlet:
enabled: true
url-pattern: /druid/*
reset-enable: true
# 设置监控页面的登录名和密码
login-username: admin
login-password: admin
# 允许访问的IP
allow: 127.0.0.1
# 不允许访问的IP
#deny: 192.168.1.100
default-auto-commit: true
jpa:
database: mysql
show-sql: true
open-in-view: false
hibernate:
naming:
physical-strategy: org.hibernate.boot.model.naming.PhysicalNamingStrategyStandardImpl
devtools:
restart:
enabled: true
selenium.ini:
driver=chrome
chrome_exec_path=/usr/local/bin/chromedriver
safari_driver_loglevel=DEBUG
华为开发者空间发布
让每位开发者拥有一台云主机
- 点赞
- 收藏
- 关注作者
评论(0)