赞
踩
Apache tika
是Apache
开源的一个文档解析工具。Apache Tika
可以解析和提取一千多种不同的文件类型(如PPT、XLS和PDF)的内容和格式,并且Apache Tika
提供了多种使用方式,既可以使用图形化操作页面(tika-app),又可以独立部署(tika-server)通过接口调用,还可以引入到项目中使用。
本文演示在spring boot 中引入tika
的方式解析文档。如下:
在spring boot 项目中引入如下依赖:
<dependencyManagement> <dependencies> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-bom</artifactId> <version>2.8.0</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers-standard-package</artifactId> </dependency>
tika-config.xml
文件放在resources
目录下。tika-config.xml
文件的内容如下:<?xml version="1.0" encoding="UTF-8"?> <properties> <encodingDetectors> <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector"> <params> <param name="markLimit" type="int">64000</param> </params> </encodingDetector> <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector"> <params> <param name="markLimit" type="int">64001</param> </params> </encodingDetector> <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector"> <params> <param name="markLimit" type="int">64002</param> </params> </encodingDetector> </encodingDetectors> </properties>
MyTikaConfig
import java.io.IOException; import java.io.InputStream; import org.apache.tika.Tika; import org.apache.tika.config.TikaConfig; import org.apache.tika.detect.Detector; import org.apache.tika.exception.TikaException; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.Parser; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.core.io.Resource; import org.springframework.core.io.ResourceLoader; import org.xml.sax.SAXException; /** * tika配置类 */ @Configuration public class MyTikaConfig { @Autowired private ResourceLoader resourceLoader; @Bean public Tika tika() throws TikaException, IOException, SAXException { Resource resource = resourceLoader.getResource("classpath:tika-config.xml"); InputStream inputStream = resource.getInputStream(); TikaConfig config = new TikaConfig(inputStream); Detector detector = config.getDetector(); Parser autoDetectParser = new AutoDetectParser(config); return new Tika(detector, autoDetectParser); } }
配置完成后在项目中可以通过注入beanTIka
使用。如下图所示:
赞
踩
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。