当前位置:   article > 正文

【Apache Tika】在SpringBoot项目中实现文档解析_tika springboot 里读取文件内容

tika springboot 里读取文件内容

1.导入依赖

  1. <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
  2. <dependency>
  3. <groupId>org.apache.tika</groupId>
  4. <artifactId>tika-core</artifactId>
  5. <version>1.27</version>
  6. </dependency>
  7. <dependency>
  8. <groupId>org.apache.tika</groupId>
  9. <artifactId>tika-parsers</artifactId>
  10. <version>1.27</version>
  11. </dependency>

2.测试代码(我读取的是.dat日志文件)

  1. //获取文件类型 返回--> text/plain
  2. Tika tika = new Tika();
  3. String detect = tika.detect(targetFile);
  4. System.out.println("Type: "+detect);
  5. //解析文档内容 返回--> xxxxx....
  6. InputStream input = new FileInputStream(targetFile);
  7. AutoDetectParser parser = new AutoDetectParser();
  8. //参数数字由文档大小决定
  9. BodyContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
  10. Metadata metadata = new Metadata();
  11. parser.parse(input, handler, metadata, new ParseContext());
  12. System.out.println("Document content: " + handler.toString());

3.提醒

如果文档过大并且new BodyContentHandler();时没有指定参数,可能会报错:
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
根据错误信息提示,可能读取超过请求限制(10万字),因为我们没有指定参数导致使用了该对象的默认值
所以我们在new BodyContentHandler(10 * 1024 * 1024);时指定参数大小,重新调试程序,即可获得元数据。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/444618
推荐阅读
相关标签
  

闽ICP备14008679号