当前位置:   article > 正文

Tesseract-OCR安装简明教程_libicu-50.1.2-15.el7

libicu-50.1.2-15.el7

引言:  OCR领域大名鼎鼎的Tesseract,开源项目,可以直接将图片中的文字进行识别,转换成文本信息,本文将简介如何安装以及进行数据的训练操作。

1.  Tesseract-OCR

   目前最新的tesseract项目已经全部迁移到了github上,我们可以从中获取所有主要的信息。

   地址: https://github.com/tesseract-ocr/tesseract

2.  Tesseract-OCR安装

  windows下的安装非常简单,直接安装可执行程序即可。这里重点介绍centos下的安装。这里提示一下,当你选择安装各类语言之时,则需要一个稍微耗时的等待操作,比如下图中所示的信息:

  

  操作系统: centos 7, JDK 8

  step1:     yum search tesseract

  1. [root@flybird ~]# yum search tesseract-ocr
  2. Loaded plugins: langpacks
  3. ========================================================================================================== Matched: tesseract-ocr ===========================================================================================================
  4. tesseract.x86_64 : Raw OCR Engine
  5. tesseract-devel.x86_64 : Development files for tesseract
  6. tesseract-langpack-afr.noarch : Afrikaans language data for tesseract
  7. tesseract-langpack-amh.noarch : Amharic language data for tesseract
  8. tesseract-langpack-ara.noarch : Arabic language data for tesseract
  9. tesseract-langpack-asm.noarch : Assamese language data for tesseract
  10. tesseract-langpack-aze.noarch : Azerbaijani language data for tesseract
  11. tesseract-langpack-aze_cyrl.noarch : "Azerbaijani language data for tesseract
  12. tesseract-langpack-bel.noarch : Belarusian language data for tesseract
  13. tesseract-langpack-ben.noarch : Bengali language data for tesseract
  14. tesseract-langpack-bod.noarch : "Tibetan language data for tesseract
  15. tesseract-langpack-bos.noarch : Bosnian language data for tesseract
  16. tesseract-langpack-bul.noarch : Bulgarian language data for tesseract
  17. tesseract-langpack-cat.noarch : Catalan language data for tesseract
  18. tesseract-langpack-ceb.noarch : Cebuano language data for tesseract
  19. ............
 step2:  yum install tesseract.x86_64

  1. [root@flybird ~]# yum install tesseract.x86_64
  2. Loaded plugins: langpacks
  3. Resolving Dependencies
  4. --> Running transaction check
  5. ---> Package tesseract.x86_64 0:3.04.00-3.el7 will be installed
  6. --> Processing Dependency: liblept.so.4()(64bit) for package: tesseract-3.04.00-3.el7.x86_64
  7. --> Processing Dependency: libicuuc.so.50()(64bit) for package: tesseract-3.04.00-3.el7.x86_64
  8. --> Processing Dependency: libicui18n.so.50()(64bit) for package: tesseract-3.04.00-3.el7.x86_64
  9. --> Running transaction check
  10. ---> Package leptonica.x86_64 0:1.72-2.el7 will be installed
  11. ---> Package libicu.x86_64 0:50.1.2-15.el7 will be installed
  12. --> Finished Dependency Resolution
  13. Dependencies Resolved
  14. =============================================================================================================================================================================================================================================
  15. Package Arch Version Repository Size
  16. =============================================================================================================================================================================================================================================
  17. Installing:
  18. tesseract x86_64 3.04.00-3.el7 epel 11 M
  19. Installing for dependencies:
  20. leptonica x86_64 1.72-2.el7 epel 928 k
  21. libicu x86_64 50.1.2-15.el7 base 6.9 M
  22. Transaction Summary
  23. =============================================================================================================================================================================================================================================
  24. Install 1 Package (+2 Dependent packages)
  25. Total download size: 19 M
  26. Installed size: 67 M
  27. Is this ok [y/d/N]: y
  28. Downloading packages:
  29. (1/3): leptonica-1.72-2.el7.x86_64.rpm | 928 kB 00:00:00
  30. (2/3): libicu-50.1.2-15.el7.x86_64.rpm | 6.9 MB 00:00:07
  31. (3/3): tesseract-3.04.00-3.el7.x86_64.rpm | 11 MB 00:00:11
  32. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  33. Total 1.7 MB/s | 19 MB 00:00:11
  34. Running transaction check
  35. Running transaction test
  36. Transaction test succeeded
  37. Running transaction
  38. Installing : leptonica-1.72-2.el7.x86_64 1/3
  39. Installing : libicu-50.1.2-15.el7.x86_64 2/3
  40. Installing : tesseract-3.04.00-3.el7.x86_64 3/3
  41. Verifying : tesseract-3.04.00-3.el7.x86_64 1/3
  42. Verifying : libicu-50.1.2-15.el7.x86_64 2/3
  43. Verifying : leptonica-1.72-2.el7.x86_64 3/3
  44. Installed:
  45. tesseract.x86_64 0:3.04.00-3.el7
  46. Dependency Installed:
  47. leptonica.x86_64 0:1.72-2.el7 libicu.x86_64 0:50.1.2-15.el7
  48. Complete!
 step 3: 安装devel 

  1. [root@flybird ~]# yum install tesseract-devel.x86_64 tesseract-osd.x86_64
  2. Loaded plugins: langpacks
  3. Resolving Dependencies
  4. --> Running transaction check
  5. ---> Package tesseract-devel.x86_64 0:3.04.00-3.el7 will be installed
  6. --> Processing Dependency: pkgconfig(lept) for package: tesseract-devel-3.04.00-3.el7.x86_64
  7. --> Running transaction check
  8. ---> Package leptonica-devel.x86_64 0:1.72-2.el7 will be installed
  9. --> Finished Dependency Resolution
  10. Dependencies Resolved
  11. =============================================================================================================================================================================================================================================
  12. Package Arch Version Repository Size
  13. =============================================================================================================================================================================================================================================
  14. Installing:
  15. tesseract-devel x86_64 3.04.00-3.el7 epel 80 k
  16. Installing for dependencies:
  17. leptonica-devel x86_64 1.72-2.el7 epel 108 k
  18. Transaction Summary
  19. =============================================================================================================================================================================================================================================
  20. Install 1 Package (+1 Dependent package)
  21. Total download size: 188 k
  22. Installed size: 1.1 M
  23. Is this ok [y/d/N]: y
  24. Downloading packages:
  25. (1/2): tesseract-devel-3.04.00-3.el7.x86_64.rpm | 80 kB 00:00:00
  26. (2/2): leptonica-devel-1.72-2.el7.x86_64.rpm | 108 kB 00:00:00
  27. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  28. Total 738 kB/s | 188 kB 00:00:00
  29. Running transaction check
  30. Running transaction test
  31. Transaction test succeeded
  32. Running transaction
  33. Installing : leptonica-devel-1.72-2.el7.x86_64 1/2
  34. Installing : tesseract-devel-3.04.00-3.el7.x86_64 2/2
  35. Verifying : leptonica-devel-1.72-2.el7.x86_64 1/2
  36. Verifying : tesseract-devel-3.04.00-3.el7.x86_64 2/2
  37. Installed:
  38. tesseract-devel.x86_64 0:3.04.00-3.el7
  39. Dependency Installed:
  40. leptonica-devel.x86_64 0:1.72-2.el7
  41. Complete!
 step 4:  安装lang package tesseract-langpack-chi_sim.noarch, tesseract-langpack-chi_tra.noarch

  1. [root@flybird ~]# yum install tesseract-langpack-chi_sim.noarch
  2. Loaded plugins: langpacks
  3. Resolving Dependencies
  4. --> Running transaction check
  5. ---> Package tesseract-langpack-chi_sim.noarch 0:3.04.00-3.el7 will be installed
  6. --> Finished Dependency Resolution
  7. Dependencies Resolved
  8. =============================================================================================================================================================================================================================================
  9. Package Arch Version Repository Size
  10. =============================================================================================================================================================================================================================================
  11. Installing:
  12. tesseract-langpack-chi_sim noarch 3.04.00-3.el7 epel 15 M
  13. Transaction Summary
  14. =============================================================================================================================================================================================================================================
  15. Install 1 Package
  16. Total download size: 15 M
  17. Installed size: 40 M
  18. Is this ok [y/d/N]: y
  19. Downloading packages:
  20. tesseract-langpack-chi_sim-3.04.00-3.el7.noarch.rpm | 15 MB 00:00:15
  21. Running transaction check
  22. Running transaction test
  23. Transaction test succeeded
  24. Running transaction
  25. Installing : tesseract-langpack-chi_sim-3.04.00-3.el7.noarch 1/1
  26. Verifying : tesseract-langpack-chi_sim-3.04.00-3.el7.noarch 1/1
  27. Installed:
  28. tesseract-langpack-chi_sim.noarch 0:3.04.00-3.el7
  29. Complete!
3.  Tesseract-OCR的使用

 a.  识别图片中的文字信息 

   命令格式:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
    操作: tesseract ttest.png out -l lang-type

   这里我们选取了两种图片,中文和英文图片;然后我们来看看OCR的效果如何。

 b. 检查tesseract支持的语言

  1. [root@flybird practice]# tesseract --list-langs
  2. List of available languages (4):
  3. eng
  4. osd
  5. chi_tra
  6. chi_sim
 基于上述的信息可知,支持四种类型,三种语言, osd是开发的脚本

 c.  进行基于中文的OCR

    原图信息:

      

     进行OCR操作,操作命令: tesseract chin-ocr.png chin-out -l chi_sim

     运行结果:

  1. [root@flybird practice]# tesseract chin-ocr.png chin-out -l chi_sim
  2. Tesseract Open Source OCR Engine v3.04.00 with Leptonica
  3. [root@flybird practice]# cat chin-out.txt
  4. 11月17日痿言 ′ 文童发文透露租妻子马伊蜊合作的新剧 (剃刀边缘) 快要刮作完
  5. 成) 感慨良多′他自称 ″过街者冒″ 租 ″笨人″ ′直言自己虽然忍不任茌片场发脾气′
  6. 但 ″i人亘″ 二字是心安理才寻她受了′
  大家可以看到,识别率还是有待提高的,很多的信息并未准确识别出来。这里注意背景中有水印信息,造成了一定干扰。

 d. 基于英文的OCR识别

    原图信息:

   

     进行OCR操作, tesseract english-ocr.png eng-ocr -l eng

      运行的结果信息:

  1. [root@flybird practice]# tesseract english-ocr.png eng-ocr -l eng
  2. Tesseract Open Source OCR Engine v3.04.00 with Leptonica
  3. [root@flybird practice]# cat eng-ocr.txt
  4. I have lived in China for a long time and we all like it very much. We do have it done.
  5. It is very funny in a good lucky state.
   基于本次的OCR结果还是非常理想的,当然这里是基于干扰非常少的情况下进行的。

4. 总结

    这里只是简要介绍了其安装信息与过程,更多的信息还是需要大家自行到tesseract上去获取信息,并自行实践的。


声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/620045
推荐阅读
  

闽ICP备14008679号