当前位置:   article > 正文

CNOCR和PaddleOCR提取pdf中文字-个人记录

CNOCR和PaddleOCR提取pdf中文字-个人记录

目录

一、PyMuPDF

二、CNOCR

 三、PaddleOCR

四、Tesseract(没试)


一、PyMuPDF

1.安装PyMuPDF

pip install pymupdf

2.pdf转txt样例

  1. import os
  2. import datetime
  3. import fitz # fitz就是pip install PyMuPDF
  4. def pyMuPDF_fitz(pdfPath):
  5. startTime_pdf2img = datetime.datetime.now() # 开始时间
  6. text_list = []
  7. pdfDoc = fitz.open(pdfPath)
  8. for page in pdfDoc:
  9. text = page.get_text()
  10. text_list.append(text)
  11. text_list = "\n".join(text_list)
  12. try:
  13. with open("/home/bingxing2/ailab/group/ai4agr/wzf/LLM/txt/test.txt", 'a+') as neirong:
  14. neirong.write(text_list)
  15. except IOError as e:
  16. print("An error occurred while writing the file:", e)
  17. endTime_pdf2img = datetime.datetime.now() # 结束时间
  18. print('pdf2img时间=', (endTime_pdf2img - startTime_pdf2img).seconds)
  19. def process_all_pdfs_in_directory(directory):
  20. for filename in os.listdir(directory):
  21. if filename.endswith('.pdf'):
  22. pdf_path = os.path.join(directory, filename)
  23. pyMuPDF_fitz(pdf_path)
  24. if __name__ == "__main__":
  25. # 指定PDF所在的目录
  26. pdf_directory = r'/home/bingxing2/ailab/group/ai4agr/wzf/LLM/pdf/'
  27. process_all_pdfs_in_directory(pdf_directory)

注:

pymupdf不能直接提取表格,要使用pdfplumber来实现

提取图片使用img=page.getImageList()

提取后发现,文字可以正常提取但是数字不能正常提取

原因:数字在PDF文件中以图像形式呈现,而不是文本形式。这种情况下,提取数字就需要进行OCR(光学字符识别)处理

因此先将pdf转为图片,在对图片提取文字(采用cnocr、paddleocr、tesseract)

pdf转图片:

  1. import os
  2. import datetime
  3. import fitz # fitz就是pip install PyMuPDF
  4. def pdf_to_images(directory, filename, output_folder):
  5. pdf_path = os.path.join(directory, filename)
  6. pdf_doc = fitz.open(pdf_path)
  7. for page_number in range(len(pdf_doc)):
  8. page = pdf_doc[page_number]
  9. image = page.get_pixmap(matrix=fitz.Matrix(4, 4), alpha=False)
  10. image_path = os.path.join(output_folder, f"{filename[:-4]}_page_{page_number + 1}.png")
  11. image.save(image_path)
  12. pdf_doc.close()
  13. def process_all_pdfs_in_directory(directory, output_folder):
  14. #pdf to img
  15. for filename in os.listdir(directory):
  16. if filename.endswith('.pdf'):
  17. pdf_to_images(directory, filename, output_folder)
  18. if __name__ == "__main__":
  19. # 指定PDF所在的目录
  20. pdf_directory = r'/home/bingxing2/ailab/group/ai4agr/wzf/LLM/pdf/books/'
  21. # 指定输出图片的目录
  22. output_folder = r'/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/books/'
  23. process_all_pdfs_in_directory(pdf_directory, output_folder)

二、CNOCR

1.安装cnocr

pip install cnocr

2.图片转文字,存入同一个txt文件

  1. import cnocr
  2. import os
  3. import datetime
  4. def recognize_text(txt_directory, image_directory):
  5. # 初始化 cnocr
  6. ocr = cnocr.CnOcr()
  7. text = []
  8. for filename in os.listdir(image_directory):
  9. if filename.endswith('.png'):
  10. startTime_pdf2img = datetime.datetime.now() # 开始时间
  11. image_path = os.path.join(image_directory, filename)
  12. # 读取图片并识别文字
  13. results = ocr.ocr(image_path)
  14. # text = [result['text'] for result in results]
  15. text = ''.join([result['text'].replace('\n', '') for result in results])
  16. # print(text)
  17. # sys
  18. # 读取一张写入一张
  19. with open(txt_directory, 'a+', encoding='utf-8') as f:
  20. f.write(text + '\n')
  21. endTime_pdf2img = datetime.datetime.now() # 结束时间
  22. print('img2txt时间 =', (endTime_pdf2img - startTime_pdf2img).seconds, ",", filename, "已写入")
  23. return text
  24. if __name__ == "__main__":
  25. # 图片文件路径
  26. image_directory = '/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/books'
  27. # txt文件路径
  28. txt_directory = "/home/bingxing2/ailab/group/ai4agr/wzf/LLM/txt/test.txt"
  29. # 识别文字
  30. recognize_text(txt_directory, image_directory)

 三、PaddleOCR

步骤:

1.安装PaddleOCR

2.准备pdf文件

3.将pdf转为图片,在对图片提取文字

安装:

1.安装PaddleOCR

pip install "paddleocr>=2.0.1"

2.安装paddlepaddle (默认安装cpu版本,gpu版本目前似乎不支持arm64架构?安装指南-使用文档-PaddlePaddle深度学习平台

gpu版本安装官网:开始使用_飞桨-源于产业实践的开源深度学习平台

pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple

验证paddlepaddle是否安装成功

  1. 进入python环境验证paddle是否安装成功
  2. python
  3. import paddle
  4. paddle.utils.run_check()

3. 图片转文字,存入同一个txt文件

  1. import paddleocr
  2. import os
  3. import datetime
  4. import fitz
  5. def recognize_text(txt_directory, image_directory, pdf_directory):
  6. # 初始化 PaddleOCR
  7. ocr = paddleocr.PaddleOCR(use_angle_cls=True, lang='ch')
  8. for filename in os.listdir(pdf_directory):
  9. if filename.endswith('.pdf'):
  10. pdf_path = os.path.join(pdf_directory, filename)
  11. pdf_doc = fitz.open(pdf_path)
  12. for page_number in range(len(pdf_doc)):
  13. image_path = os.path.join(image_directory, f"{filename[:-4]}_page_{page_number + 1}.png")
  14. startTime_pdf2img = datetime.datetime.now() # 开始时间
  15. # 读取图片并识别文字
  16. results = ocr.ocr(image_path, cls=True)
  17. text = ''.join([result[1][0] for result in results[0]])
  18. # print(text)
  19. # 写入识别结果到文本文件
  20. with open(txt_directory, 'a+', encoding='utf-8') as f:
  21. f.write(text + '\n')
  22. endTime_pdf2img = datetime.datetime.now() # 结束时间
  23. print('img2txt时间 =', (endTime_pdf2img - startTime_pdf2img).seconds, ",", f"{filename[:-4]}_page_{page_number + 1}.png", "已写入")
  24. if __name__ == "__main__":
  25. # 图片文件路径
  26. image_directory = '/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/books'
  27. # txt文件路径
  28. txt_directory = "/home/bingxing2/ailab/group/ai4agr/wzf/LLM/txt/testpaddlepaddleocr.txt"
  29. # 指定PDF所在的目录
  30. pdf_directory = '/home/bingxing2/ailab/group/ai4agr/wzf/LLM/pdf/books/'
  31. # 识别文字
  32. recognize_text(txt_directory, image_directory, pdf_directory)

4.报错:

  1. from paddleocr import PaddleOCR
  2. import re
  3. ocr = PaddleOCR(lang="ch") # 使用中文识别
  4. result = ocr.ocr("/home/bingxing2/ailab/group/ai4agr/wzf/LLM/images/page_1.png")
  5. for line in result:(myenv) [scxlab0069@paraai-n32-h-01-ccs-master-1 wzf]$ python /home/bingxing2/ailab/group/ai4agr/wzf/LLM/ocr/paddleocr/img_to_txt.py
  6. download https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar to /home/bingxing2/ailab/scxlab0069/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer/ch_PP-OCRv4_det_infer.tar
  7. 100%|███████████████████████████████████████████████████████████████████████| 4.89M/4.89M [00:00<00:00, 13.9MiB/s]
  8. --------------------------------------
  9. C++ Traceback (most recent call last):
  10. --------------------------------------
  11. 0 inflateReset2
  12. ----------------------
  13. Error Message Summary:
  14. ----------------------
  15. FatalError: `Segmentation fault` is detected by the operating system.
  16. [TimeInfo: *** Aborted at 1715233076 (unix time) try "date -d @1715233076" if you are using GNU date ***]
  17. [SignalInfo: *** SIGSEGV (@0x4ad7b62366b7a28) received by PID 4127064 (TID 0x40000b615370) from PID 913013288 ***]
  18. Segmentation fault
  19. print(line) # 输出识别结果 报错

解决办法:paddlepaddle2.6版本太高了,重新安装paddlepaddle2.5.2版本即可, 参考CPU版本下的报错信息:`Segmentation fault` is detected by the operating system · Issue #12075 · PaddlePaddle/PaddleOCR · GitHub

四、Tesseract(没试)

1.安装依赖

>  yum install autoconf automake libtool  libjpeg-devel libpng-devel libtiff-devel zlib-devel make

报错,没限权。改为pip,报错。原因:libjpeg-devellibpng-devellibtiff-develzlib-devel 这些包通常是系统软件包管理器(如 yum)提供的,而不是通过 Python 包管理器(如 pip)安装的。它们是用于开发和编译过程中的依赖库,不是 Python 包。 

2.安装依赖的Leptonica库

  1. wget https://github.com/DanBloomberg/leptonica/releases/download/1.80.0/leptonica-1.80.0.tar.gz
  2. tar -xzvf leptonica-1.80.0.tar.gz
  3. cd leptonica-1.80.0
  4. ./configure --prefix=/home/tess4j/leptonica-1.80.0 && make && make install

3,将 Leptonica加入环境变量

  1. vim /etc/profile
  2. 插入
  3. export LD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/home/tess4j/leptonica-1.80.0/lib
  4. export LIBLEPT_HEADERSDIR=/home/tess4j/leptonica-1.80.0/include
  5. export PKG_CONFIG_PATH=/home/tess4j/leptonica-1.80.0/lib/pkgconfig

 退出后让配置生效

source /etc/profile

 4.安装Tesseract-OCR

  1. wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/4.1.1.tar.gz
  2. 重名名下压缩包
  3. mv 4.1.1.tar.gz tesseract-4.1.1.tar.gz
  4. tar -xzvf tesseract-4.1.1.tar.gz
  5. cd tesseract-4.1.1/
  6. ./autogen.sh
  7. ./configure --prefix=/home/tess4j/tesseract-4.1.1 && make && make install
  8. sudo ldconfig

 5.配置Tesseract环境变量

  1. vim /etc/profile
  2. PATH=$PATH:/home/tess4j/tesseract-4.1.1/bin
  3. export PATH
  4. export TESSDATA_PREFIX=/home/temp/tessData ##注意:该位置是训练库所在文件目录
  5. export PATH=$PATH:$TESSDATA_PREFIX
  6. source /etc/profile

6.测试安装是否成功

tesseract --version

 7.测试

  1. 识别图片命令
  2. tesseract 567.png outputteee -l chi_sim+eng
  3. 参数说明
  4. tesseract = 命令
  5. 567.png=当前目录文件
  6. outputteee=会在当前目录生成outputteee.txt文件
  7. -l chi_sim+eng=中文+英文,如果是单个语言-l chi_sim就可以了

 参考:

Linux 最全安装Tesseract_linux安装tesseract-CSDN博客

参考:

PaddleOCR—图片文字识别提取—快速使用教程_paddleocr使用教程-CSDN博客

Paddlepaddle-GPU版本安装_paddlepaddle-gpu 安装版本-CSDN博客

【paddle-gpu2.5版本安装踩坑记录】_paddle2.5-CSDN博客 

PaddleOCR详解和识别图片中文字_paddle ocr-CSDN博客

PaddleOCR详解和识别图片中文字_paddle ocr-CSDN博客

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/570623
推荐阅读
相关标签
  

闽ICP备14008679号