java pdf转txt用于文档全文检索

举报
Amrf 发表于 2019/03/12 21:37:10 2019/03/12
【摘要】 待处理https://pdfbox.apache.org/ https://stackoverflow.com/questions/18098400/how-to-get-raw-text-from-pdf-file-using-java https://stackoverflow.com/questions/50692771/multiple-pdf-file-to-txt-in-java...

待处理

https://pdfbox.apache.org/

https://stackoverflow.com/questions/18098400/how-to-get-raw-text-from-pdf-file-using-java

https://stackoverflow.com/questions/50692771/multiple-pdf-file-to-txt-in-java

https://stackoverflow.com/questions/30570196/how-to-convert-pdf-into-text-file-using-itext-liberary

https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox

https://stackoverflow.com/questions/583615/pdf-to-text-tool-or-java-library

https://stackoverflow.com/questions/17986305/how-can-i-convert-pdf-file-to-word-file-using-java


lucene 全文检索

https://www.toptal.com/database/full-text-search-of-dialogues-with-apache-lucene(https://github.com/dougsparling/lucene-testbed)

https://stackoverflow.com/questions/6807701/lucene-full-text-search

https://medium.com/@wkrzywiec/full-text-search-with-hibernate-search-lucene-part-1-e245b889aa8e

(https://github.com/wkrzywiec/Library-Spring/tree/163fbbac65750b199cc665a2ba61fd4b80fc2ff6)

https://blog.csdn.net/forfuture1978/article/details/4711308

https://blog.csdn.net/yerenyuan_pku/article/details/72582979

https://blog.csdn.net/u014704496/article/details/40408387


https://www.baeldung.com/lucene-file-search(https://github.com/eugenp/tutorials/tree/master/lucene)

https://github.com/tantivy-search/tantivy

https://www.wave-access.com/public_en/blog/2014/october/02/full-text-search-by-using-apache-lucene.aspx


分解出pdf中的目录:

https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/pdmodel/PDDocument.html


【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。