java解出图片中的文字
测试代码:
public static String crackImage(String filePath) { File imageFile = new File(filePath); ITesseract instance = new Tesseract(); instance.setDatapath("D:\\Users\\DetectText\\tessdata"); try { String result = instance.doOCR(imageFile); return result; } catch (TesseractException e) { System.err.println(e.getMessage()); return "Error while reading image"; } } public static void main(String[] args) { // TODO Auto-generated method stub System.out.println(ImageCracker.crackImage("D:\\Users\\DetectText\\tessdata\\captcha1.png")); }
用博客的验证码试了试,几乎可以算完全识别不出来
看来不使用针对性训练数据不行
排除样本中的干扰文字和干扰折线,只保留红色文字
BufferedImage initImage; try { initImage = ImageIO.read(new File(filePath)); int width = initImage.getWidth(null), height = initImage.getHeight(null); BufferedImage image = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB); Graphics g = image.getGraphics(); g.drawImage(initImage, 0, 0, null); for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++) { int pixel = image.getRGB(x, y); Color color = new Color(pixel); if (color.getRed()<235) { image.setRGB(x, y, 0xffffff); } } } ImageIO.write(image, "png", new File("D:\\Users\\DetectText\\tessdata\\captcha_1.png")); } catch (IOException e1) { e1.printStackTrace(); }
使用jTessBoxEditor编辑并生成训练集
参考:https://zhuanlan.zhihu.com/p/57826761+&cd=2&hl=zh-CN&ct=clnk&gl=sg
使用产生的新的训练集进行测试验证
File imageFile = new File(filePath); ITesseract instance = new Tesseract(); instance.setDatapath("D:\\Users\\DetectText\\tessdata"); instance.setTessVariable("tessedit_char_whitelist", "g");//0123456789abcdefghijklmnopqrstuvwxyz //instance.setTessVariable("editor_image_text_color", "RED"); //instance.setPageSegMode(7); String result = instance.doOCR(imageFile);
--我就测了一个字母,而且是训练集和测试集完全一样的情况;
感觉如果识别出图片中的字体格式再进一步操作会有比较高的准确率,也不用准备很多的训练集
参考:
https://dzone.com/articles/reading-text-from-images-using-java-1
https://github.com/csanuragjain/extra/tree/master/ReadFromImages
https://stackoverflow.com/questions/18095708/tess4j-doesnt-use-its-tessdata-folder
https://github.com/tesseract-ocr/tessdata
//========================
Tesseract-OCR-04-使用 jTessBoxEditor 进行训练
limit-characters-tesseract-is-looking-for
OCR技术现在到了什么水平?如果图像模糊到人眼识别不出来的话,它还可以识别出来么
write-with-opencv-ocr-tessdata
tess4j-set-only-to-identify-numbers-and-letters
recognize-coloured-text-with-tesseract-tess4j
using-gpuimages-adaptive-threshold-filter
tesseract-not-picking-up-different-colored-text
Java- How to remove background color from an image
Inconsistent error message when eng.traineddata not found
tesseract-for-java-setting-tessdata-prefix-for-executable-jar
Tesseract OCR training gives 'APPLY_BOXES' errors
Trained Tesseract on 瘦金体 successfully!!
//========================
jtessboxeditor需要netbeans编译,下面是一个netbeans9 windows镜像
ant:jtessboxeditor junit error
//========================
fatal-error-in-launcher-unable-to-create-process-using-c-program-files-x86
importerror-no-module-named-pil
multi-python-version-on-windows
移动python目录到非c盘之后,pip3安装命令报错,可以使用python -m pip安装,解决pip3 Unable to create process using报错的方法还没看到 python -m pip install Pillow
//========================
系统环境变量可以使用set Path=C:后再重启一个控制台,用户环境变量可以重启explorer.exe, 临时添加可以直接SET PATH=%PATH%;C:\xxxx
adding-directory-to-path-environment-variable-in-windows
/*---------------------------分割线--------------------------------------------*/
使用上面的方法还是没有达到很准的匹配效果,也许采用局部的pattern特征匹配可以,
对于g来说上面这部分如果没有被遮挡就可以作为一个识别特征
//==========
使用模板匹配时,不调用这个Core.normalize(result, result, 0, 1, Core.NORM_MINMAX, -1, new Mat()),而直接使用Imgproc.TM_CCOEFF_NORMED这类方法可以获取到比较明显的得分差异;
/*---------------------------分割线--------------------------------------------*/
看了一下CAPTCHA-caffe感觉是很好的方案,准备测一测;
//========================
building-a-java-edge-detection-application
org.opencv.imgproc.Imgproc Canny
//========================
detecting-image-in-another-image-image-comparison
Finding pattern of points in an another image (OpenCV)
unsatisfiedlinkerror-no-opencv-java249-in-java-library-path
Java openCV – 使用Imgproc.matchTemplate方法后,如何检查结果?
//========================
multi-scale-template-matching-using-python-openc
scale-and-rotation-invariant-template-matching
how-does-macthtemplate-deal-with-scaling
OpenCV探索之路(九):模板匹配 if (minVal < 0.001)
//========================
OpenCV-Python 霍夫直线检测-HoughLinesP函数参数
//========================
//========================
【OpenCV入门教程之十一】 形态学图像处理(二):开运算、闭运算、形态学梯度、顶帽、黑帽合辑 - 【浅墨的游戏编程Blog】毛星云(浅墨)的专栏 - CSDN博客
OpenCV 轮廓匹配 - vine_branches的专栏 - CSDN博客
Opencv2.4学习::二值化(2)threshold - dieju8330的博客 - CSDN博客
OpenCV成长之路(8):直线、轮廓的提取与描述 - ☆Ronny丶 - 博客园
Opencv-Python学习笔记六——边界增加copyMakeBorder,位运算bitwise_*,色彩空间cvtColor - 简书
//==================
PHP验证码[,PHP检验码][,PHP校验码][,PHP生成验证码][,PHP获取验证码] - barack-毛巴马的个人空间 - OSCHINA
10-1验证码生成+10-2生成tfrecord - Josie_chen - 博客园
ASP.NET图形验证码的生成 - JustXIII - 博客园
【转】Java生成图片验证码 - shindoyang - 博客园
【验证码生成及破解】第一部分:验证码生成及验证码图片生成TFRecord - 寸先生的AI道路 - CSDN博客
//=================
https://blog.csdn.net/sinat_14916279/article/details/56489601
https://github.com/LouieYang/CAPTCHA-caffe
/*--------------------------分割线-------------------------------------*/
测试记录:
在caffe基础环境下添加两个包:
python -m pip install captcha
python -m pip install h5py
这个测试项目的是python2环境,我的caffe环境是python3所以要改造一下;
python generator/generator.py --ntrain 200 --nval 100
generator.py:
print格式改成p3写法
json_content.iteritems()=>json_content.items()
添加两个导入
from PIL import Image import h5py
im.resize的参数改为tuple
solver.prototxt中的net改为相对于命令运行目录,即添加model/
caffe train --solver=model/solver.prototxt
参考:
https://github.com/apache/incubator-mxnet/issues/3205
https://stackoverflow.com/questions/30418481/error-dict-object-has-no-attribute-iteritems/30418498
https://stackoverflow.com/questions/20443846/python-pil-nameerror-global-name-image-is-not-defined
- 点赞
- 收藏
- 关注作者
评论(0)