- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

java解出图片中的文字

Amrf 发表于 2019/10/30 21:24:51 2019/10/30

【摘要】 https://dzone.com/articles/reading-text-from-images-using-java-1https://github.com/csanuragjain/extra/tree/master/ReadFromImageshttps://stackoverflow.com/questions/18095708/t...

测试代码:

public static String crackImage(String filePath) {
        File imageFile = new File(filePath);
        ITesseract instance = new Tesseract(); 

        instance.setDatapath("D:\\Users\\DetectText\\tessdata");

        try {
            String result = instance.doOCR(imageFile);
            return result;
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
            return "Error while reading image";
        }
}
public static void main(String[] args) {
// TODO Auto-generated method stub
    System.out.println(ImageCracker.crackImage("D:\\Users\\DetectText\\tessdata\\captcha1.png"));
}

用博客的验证码试了试，几乎可以算完全识别不出来

看来不使用针对性训练数据不行

排除样本中的干扰文字和干扰折线，只保留红色文字

BufferedImage initImage;
try {
	initImage = ImageIO.read(new File(filePath));
    int width = initImage.getWidth(null),
            height = initImage.getHeight(null);
    BufferedImage image = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
    Graphics g = image.getGraphics();
    g.drawImage(initImage, 0, 0, null);
    for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
            int pixel = image.getRGB(x, y);
            Color color = new Color(pixel);
            if (color.getRed()<235) {
            	image.setRGB(x, y, 0xffffff);
            }
        }
    }
    ImageIO.write(image, "png", new File("D:\\Users\\DetectText\\tessdata\\captcha_1.png"));
} catch (IOException e1) {
	e1.printStackTrace();
}

使用jTessBoxEditor编辑并生成训练集

参考：https://zhuanlan.zhihu.com/p/57826761+&cd=2&hl=zh-CN&ct=clnk&gl=sg

使用产生的新的训练集进行测试验证

File imageFile = new File(filePath);
ITesseract instance = new Tesseract(); 

instance.setDatapath("D:\\Users\\DetectText\\tessdata");
instance.setTessVariable("tessedit_char_whitelist", "g");//0123456789abcdefghijklmnopqrstuvwxyz
//instance.setTessVariable("editor_image_text_color", "RED");
//instance.setPageSegMode(7);
String result = instance.doOCR(imageFile);

--我就测了一个字母,而且是训练集和测试集完全一样的情况;

感觉如果识别出图片中的字体格式再进一步操作会有比较高的准确率,也不用准备很多的训练集