赞
踩
如果你想在linux进行多种文件类型的文件文本内容的提取,doctotext是一个很好的开源库。
github:https://github.com/tokgolich/doctotext
当然了官方在多年前已经停止更新这个开源库了,并且如果你仔细阅读源码,会看到部分TODO,就是还有很多可以完善的地方。
另外说明:我对于这么多文件类型格式并不那么了解和熟悉,本文只是将遇到过的相关问题做个描述,以及个人意见,主要是让有需要的人不要踩类似的坑,如果是对于文件格式了解的大佬,有错误处还望多指教
先说下我看到的源码的一部分,我主要是看了文档格式解析流程,对于具体那么多文件格式,如何解析并没有深入研究。我自己对于OFD文件格式解析比较了解(当然也是在别人的源码上做二次修改的)
源码解析流程可以理解为:
图片来自于:https://www.it610.com/article/1936051.htm
我觉得总结的挺好
如果你想了解详细的文件解析流程,自行阅读plain_text_extractor.cpp plain_text_extractor.h即可
接下来是我在实际开发中遇到过的问题:
1.doctotext对于rtf文件的解析并不是那么的完美。
2.对于被加密的文件DOC文件(非内容加密),会导致的内存占用过大的问题
以上是在大量(4个G左右的文件,各种文件类型)测试文件测试过程中遇到的。
对于第一个问题,你可能需要自己另外寻找开源的rtf文件解析源码来代替doctotext的rtf文件解析,另外比如说,doc文件嵌套文件,这种情况解析也不是很好。
对于第二个问题:
我所说的文件被加密如下
此时使用doctotext解析这种文件会发生什么情况呢,一下是各平台解析出错的堆栈信息
命令行调试复现(./doctotext --type fullpath)
(gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007ffff5e4051a in __GI_abort () at abort.c:89 #2 0x00007ffff675707d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x00007ffff6755046 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x00007ffff6755091 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007ffff67552a9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff6754032 in __cxa_throw_bad_array_new_length () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff737a470 in ?? () from ./libwv2.so.4 ---------------------------------------------------------------- //原因:底层UString在构造时传入的size作为开辟空间的大小,然后这个size在fd_x86里面为负数导致错误(负数是源码加了日志打印得知) #8 0x00007ffff737a5f5 in wvWare::UString::UString(wvWare::UChar const*, int) () from ./libwv2.so.4 #9 0x00007ffff7302e26 in wvWare::Word97::FFN::read(wvWare::OLEStreamReader*, wvWare::Word97::FFN::Version, bool) () from ./libwv2.so.4 #10 0x00007ffff7302a7a in wvWare::Word97::FFN::FFN(wvWare::OLEStreamReader*, wvWare::Word97::FFN::Version, bool) () from ./libwv2.so.4 #11 0x00007ffff738c5c7 in wvWare::FontCollection::FontCollection(wvWare::OLEStreamReader*, wvWare::Word97::FIB const&) () from ./libwv2.so.4 #12 0x00007ffff736eda8 in wvWare::Parser9x::init() () from ./libwv2.so.4 #13 0x00007ffff736e0d8 in wvWare::Parser9x::Parser9x(wvWare::OLEStorage*, wvWare::OLEStreamReader*, wvWare::Word97::FIB const&) () from ./libwv2.so.4 #14 0x00007ffff7379243 in wvWare::Parser97::Parser97(wvWare::OLEStorage*, wvWare::OLEStreamReader*) () from ./libwv2.so.4 #15 0x00007ffff737985a in ?? () from ./libwv2.so.4 #16 0x00007ffff7379d04 in wvWare::ParserFactory::createParser(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from ./libwv2.so.4 #17 0x000055555556ae34 in ?? () #18 0x00005555555c8512 in ?? () #19 0x00005555555ca0e1 in ?? () #20 0x0000555555566b4d in main () ================================= fd_x86_64 以上是在方得x86_64 (gdb) bt #0 0x0000fffff7cc19cc in wvWare::Word97::LFO::clear() () at ./libwv2.so.4 #1 0x0000fffff7cc17bc in wvWare::Word97::LFO::LFO(wvWare::OLEStreamReader*, bool) () at ./libwv2.so.4 #2 0x0000fffff7d2ccd4 in wvWare::ListFormatOverride::ListFormatOverride(wvWare::OLEStreamReader*) () at ./libwv2.so.4 #3 0x0000fffff7d2def4 in wvWare::ListInfoProvider::readListFormatOverride(wvWare::OLEStreamReader*) () at ./libwv2.so.4 #4 0x0000fffff7d2d7c8 in wvWare::ListInfoProvider::ListInfoProvider(wvWare::OLEStreamReader*, wvWare::Word97::FIB const&, wvWare::StyleSheet const*) () at ./libwv2.so.4 #5 0x0000fffff7d0c234 in wvWare::Parser9x::init() () at ./libwv2.so.4 #6 0x0000fffff7d0b530 in wvWare::Parser9x::Parser9x(wvWare::OLEStorage*, wvWare::OLEStreamReader*, wvWare::Word97::FIB const&) () at ./libwv2.so.4 #7 0x0000fffff7d16408 in wvWare::Parser97::Parser97(wvWare::OLEStorage*, wvWare::OLEStreamReader*) () at ./libwv2.so.4 #8 0x0000fffff7d16b3c in () at ./libwv2.so.4 #9 0x0000fffff7d16fc0 in wvWare::ParserFactory::createParser(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () at ./libwv2.so.4 #10 0x000000000040cb44 in () #11 0x0000000000462568 in () #12 0x0000000000463a94 in () #13 0x0000000000409248 in main () (gdb) bt #0 0x0000fffff76aa1f0 in __GI___libc_write (fd=1, buf=buf@entry=0x529dc0, nbytes=nbytes@entry=15) at ../sysdeps/unix/sysv/linux/write.c:26 #1 0x0000fffff7656788 in _IO_new_file_write (f=0xfffff7754588 <_IO_2_1_stdout_>, data=0x529dc0, n=15) at fileops.c:1185 #2 0x0000fffff7655b88 in new_do_write (fp=0xfffff7754588 <_IO_2_1_stdout_>, data=0x529dc0 "U32 ret= 65535\nadU32()=912810866\n", to_do=to_do@entry=15) at libioP.h:839 #3 0x0000fffff7657830 in _IO_new_do_write (to_do=15, data=<optimized out>, fp=<optimized out>) at fileops.c:430 #4 0x0000fffff7657830 in _IO_new_do_write (fp=fp@entry=0xfffff7754588 <_IO_2_1_stdout_>, data=<optimized out>, to_do=15) at fileops.c:430 #5 0x0000fffff7657c78 in _IO_new_file_overflow (f=0xfffff7754588 <_IO_2_1_stdout_>, ch=10) at fileops.c:791 #6 0x0000fffff79455c8 in std::ostream::put(char) () at /lib/aarch64-linux-gnu/libstdc++.so.6 #7 0x0000fffff7945830 in std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&) () at /lib/aarch64-linux-gnu/libstdc++.so.6 #8 0x0000fffff7ca39ac in wvWare::OLEStreamReader::readU32() () at ./libwv2.so.4 #9 0x0000fffff7ca39cc in wvWare::OLEStreamReader::readS32() () at ./libwv2.so.4 #10 0x0000fffff7cc08e0 in wvWare::Word97::LFO::read(wvWare::OLEStreamReader*, bool) () at ./libwv2.so.4 #11 0x0000fffff7cc0854 in wvWare::Word97::LFO::LFO(wvWare::OLEStreamReader*, bool) () at ./libwv2.so.4 #12 0x0000fffff7d2be00 in wvWare::ListFormatOverride::ListFormatOverride(wvWare::OLEStreamReader*) () at ./libwv2.so.4 #13 0x0000fffff7d2d108 in wvWare::ListInfoProvider::readListFormatOverride(wvWare::OLEStreamReader*) () at ./libwv2.so.4 #14 0x0000fffff7d2c984 in wvWare::ListInfoProvider::ListInfoProvider(wvWare::OLEStreamReader*, wvWare::Word97::FIB const&, wvWare::StyleSheet const*) () at ./libwv2.so.4 #15 0x0000fffff7d0b360 in wvWare::Parser9x::init() () at ./libwv2.so.4 #16 0x0000fffff7d0a60c in wvWare::Parser9x::Parser9x(wvWare::OLEStorage*, wvWare::OLEStreamReader*, wvWare::Word97::FIB const&) () at ./libwv2.so.4 #17 0x0000fffff7d15534 in wvWare::Parser97::Parser97(wvWare::OLEStorage*, wvWare::OLEStreamReader*) () at ./libwv2.so.4 #18 0x0000fffff7d15c68 in () at ./libwv2.so.4 #19 0x0000fffff7d160ec in wvWare::ParserFactory::createParser(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () at ./libwv2.so.4 ========================================================================== arm aarch64 //源码加日志调试发现这个函数里出现了问题 void ListInfoProvider::readListFormatOverride( OLEStreamReader* tableStream ) { //原因:tableStream->readU32()获取到的count一个非常大的数 导致在arm平台上这里循环了一亿多次,每次都在堆上new),然后内存占用了16G,导致电脑卡死(客户现场测试遇到的,回来复现了) const U32 count = tableStream->readU32(); #ifdef WV2_DEBUG_LIST_READING wvlog << "ListInfoProvider::readListFormatOverride(): count=" << count << std::endl; #endif for ( U32 i = 0; i < count; ++i ) m_listFormatOverride.push_back( new ListFormatOverride( tableStream ) ); std::vector<ListFormatOverride*>::const_iterator it = m_listFormatOverride.begin(); std::vector<ListFormatOverride*>::const_iterator end = m_listFormatOverride.end(); for ( ; it != end; ++it ) { const U8 levelCount = ( *it )->countOfLevels(); for ( int i = 0; i < levelCount; ++i ) { // Word seems to write 0xff pagging-bytes between LFO and LFOLVLs, also // between different LFOLVLs, get rid of it (Werner) eatLeading0xff( tableStream ); ( *it )->appendListFormatOverrideLVL( new ListFormatOverrideLVL( tableStream ) ); } } } ListInfoProvider::ListInfoProvider( OLEStreamReader* tableStream, const Word97::FIB& fib, const StyleSheet* styleSheet ) : m_listNames( 0 ), m_pap( 0 ), m_styleSheet( styleSheet ), m_currentLfoLVL( 0 ), m_currentLst( 0 ), m_version( Word8 ) { #ifdef WV2_DEBUG_LIST_READING wvlog << "ListInfoProvider::ListInfoProvider() ################################" << std::endl << " fcPlcfLst=" << fib.fcPlcfLst << " lcbPlcfLst=" << fib.lcbPlcfLst << std::endl << " fcPlfLfo=" << fib.fcPlfLfo << " lcbPlfLfo=" << fib.lcbPlfLfo << std::endl << " fcSttbListNames=" << fib.fcSttbListNames << " lcbSttbListNames=" << fib.lcbSttbListNames << std::endl; #endif tableStream->push(); if ( fib.lcbPlcfLst != 0 ) { tableStream->seek( fib.fcPlcfLst, G_SEEK_SET ); readListData( tableStream, fib.fcPlcfLst + fib.lcbPlcfLst ); } if ( fib.lcbPlfLfo != 0 ) { if ( static_cast<U32>( tableStream->tell() ) != fib.fcPlfLfo ) { wvlog << "Found a \"hole\" within the table stream (list data): current=" << tableStream->tell() << " expected=" << fib.fcPlfLfo << std::endl; tableStream->seek( fib.fcPlfLfo, G_SEEK_SET ); } readListFormatOverride( tableStream ); } if ( fib.lcbSttbListNames != 0 ) { // Get rid of leading garbage. Take care, though, as the STTBF most likely starts // with 0xffff (extended character STTBF) while ( static_cast<U32>( tableStream->tell() ) < fib.fcSttbListNames && tableStream->readU8() == 0xff ); // the ; is intended! // Check the position and warn about corrupt files if ( static_cast<U32>( tableStream->tell() ) != fib.fcSttbListNames ) { wvlog << "Found a \"hole\" within the table stream (list format override): current=" << tableStream->tell() << " expected=" << fib.fcSttbListNames << std::endl; tableStream->seek( fib.fcSttbListNames, G_SEEK_SET ); } readListNames( tableStream ); } tableStream->pop(); #ifdef WV2_DEBUG_LIST_READING wvlog << "ListInfoProvider::ListInfoProvider() done ###########################" << std::endl; #endif } FIBFCLCB::FIBFCLCB(OLEStreamReader *stream, bool preservePos) { clear(); read(stream, preservePos); }
处理方法:
1.因为加密后的office文件的文件头和正常的office文件相同,所以我们在使用doctotext识别前过滤加密的office文件(如何过滤,需要对于文件格式十分熟悉,我所知道的是,加密的office文件的特定的一个位置是一个固定的字节,取出特定位置的字节检查一下,比如可能他是0x51,这个比如是真实的,但不同类型03/07还可以具体细分是不一样的,不做赘述。或者这个文件可以当做压缩文件处理,如果发现里面有某个特定文件,那么他就是加密文件)
如果过滤的没有那么精确(由于你对于文件格式不是那么了解,导致遗漏)还是不小心传入解析,导致内存过大卡死这时候
==》2.我们开启一个监控线程如果doctotext进程占用内存超过1G 。我们便kill进程
获取进程的物理内存,双重保险,
这里为什么说监控doctotext进程,因为我个人觉得除非你对于doctotext十分熟悉,并且对于实际中遇到的各种类型文件格式也十分了解,你可以对doctotext进行二次开发,完善各种奇怪的问题,完善各种异常等等
否则我建议将doctotext编成一个程序,使用使用system或者popen 调用./doctotex --filetype fullpath来进行解析,这样在doc解析时即便出现意外,也只是单独进程崩溃而不影响你自身程序,只是当前程序解析的这一个文件失败了而已。并且如果你是多线程进行文件解析,对于doctotext源码不是十分熟悉,也可能出现各种问题在实际的开发中
//监控doctotext进程,占用内存超过1G则认定为异常并kill while(1) { sleep(1); string res; int pid,size; char cmd[128] = ""; char buf[128] = ""; FILE* fp; sprintf(cmd,"ps aux | grep \"doctotext\" | grep -v \"grep\" | awk -F\" \" '{print $2 \"|\" $6}'"); fp = popen(cmd,"r"); if (fp == NULL) { pclose(fp); continue; } fgets(buf, sizeof(buf), fp); res = buf; if (res.size() <= 0) { pclose(fp); continue; } if (res.find("|") != string::npos) { string::size_type pos = res.find("|"); pid = stoi(res.substr(0, pos)); size = stoi(res.substr(pos + 1)); } //doctotext超过1G则kill掉进程 if (size / 1024 >= 1024) { memset(cmd,0,sizeof(cmd)); sprintf(cmd, "kill -9 %d", pid); cout << cmd << endl; system(cmd); } pclose(fp); }
如果你还担心出问题,可以对解析时间过长的doctotext进程也进行kill
总结一下:doctoext可以作为多种类型文件内容的提取,但是不是那么完善,比如rtf不是那么完美,如果你有信心二次开发细节处理的很好可以直接作为库添加到项目,如果不行,那么请使用单独进程调用的方式来做文件解析,这样可以省去你很多不必要的麻烦,并且你可以自己添加其他文件类型并入到doctotext,比如ofd类型。至此已经可以满足绝大多数文件类型的解析了,并且效果还是不错的
另:文件真实类型判断可以使用libmagic.so中的magic_open,magic_load,magic_file,magic_close等函数来获取
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。