Tika解析word文件

作者：小蓝xlanll | 2024-06-03 02:20:12

踩

apache tika解析word

Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files

http://poi.apache.org/document/

http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.poi/poi-scratchpad/3.7

http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.7

对Doc文件的解析

需要poi-scratchpad/3.7.jar

POI-HWPF - A Quick Guide

基本的文本提取

有两个输入参数：inputstream,HWPFDocument,

getText()方法是得到所有的文本内容，

getParagraphText()是得到每一段的文本内容，

getTextFromPieces()是得到每一页的文本内容

特定文本属性提取

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

第一步：创建HWPFDocument

第二步：得到Range

getRange()： Returns the range which covers the whole of the document, but excludes any headers（页眉） and footers（页脚）.

int numParagraphs() Used to get the number of paragraphs in a range.

int numSections() Used to get the number of sections in a range（这个是“节”，就是插入、分隔符中的“节”）

第三步：得到段落

getParagraph()：

getText()

public static void main(String[] args) throws Exception {
        InputStream istream = new FileInputStream(
                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");
        HWPFDocument doc = new HWPFDocument(istream);
        Range range = doc.getRange();// Returns the range which covers the whole
                                        // of the document, but excludes any
                                        // headers and footers.
        for (int i = 0; i < range.numParagraphs(); i++) {
            Paragraph poiPara = range.getParagraph(i);
            int j = 0;
            while (true) {
                CharacterRun run = poiPara.getCharacterRun(j++);
                System.out.println("Color " + run.getColor());//颜色
                System.out.println("Font size " + run.getFontSize());//字体大小
                System.out.println("Font Name " + run.getFontName());//字体名称
                System.out.println(run.isBold() + " " + run.isItalic() + " "
                        + run.getUnderlineCode());//加粗，斜体，下划线
                System.out.println("Text is " + run.text());//文本内容
                if (run.getEndOffset() == poiPara.getEndOffset()) {
                    break;
                }
            }
        }


    }

对Docx文件的解析

需要poi-ooxml/3.7.jar

http://poi.apache.org/document/quick-guide-xwpf.html

package test;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;

public class ParseWordDocxTest {

    /**
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        InputStream istream = new FileInputStream(
                "e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");
        XWPFDocument docx = new XWPFDocument(istream);
        List<XWPFParagraph> paraGraph = docx.getParagraphs();
        for(XWPFParagraph para :paraGraph ){
            List<XWPFRun> run = para.getRuns();
            for(XWPFRun r : run){
                int i = 0;
                System.out.println("字体颜色："+r.getColor());
                System.out.println("字体名称:"+r.getFontFamily());
                System.out.println("字体大小："+r.getFontSize());
                System.out.println("Text:"+r.getText(i++));
                System.out.println("粗体？："+r.isBold());
                System.out.println("斜体？："+r.isItalic());
                
            }
        }

    }

}

转载于:https://www.cnblogs.com/yuwenfeng/p/3624937.html

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小蓝xlanll/article/detail/665339