赞
踩
- <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
- <dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika-core</artifactId>
- <version>1.27</version>
- </dependency>
-
- <dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika-parsers</artifactId>
- <version>1.27</version>
- </dependency>
- //获取文件类型 返回--> text/plain
- Tika tika = new Tika();
- String detect = tika.detect(targetFile);
- System.out.println("Type: "+detect);
-
- //解析文档内容 返回--> xxxxx....
- InputStream input = new FileInputStream(targetFile);
- AutoDetectParser parser = new AutoDetectParser();
-
- //参数数字由文档大小决定
- BodyContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
- Metadata metadata = new Metadata();
- parser.parse(input, handler, metadata, new ParseContext());
- System.out.println("Document content: " + handler.toString());
如果文档过大并且new BodyContentHandler();时没有指定参数,可能会报错:
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
根据错误信息提示,可能读取超过请求限制(10万字),因为我们没有指定参数导致使用了该对象的默认值。
所以我们在new BodyContentHandler(10 * 1024 * 1024);时指定参数大小,重新调试程序,即可获得元数据。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。