赞
踩
识别图片中的文字
首先把下载好的tessdata放在自己项目的bin\Debug\tessdata文件夹中。
附一个tessdata的下载地址:https://github.com/tesseract-ocr/tessdata
命名空间:
- using System.Drawing;
- using Tesseract;
- using System.IO;
需要NuGet的包:Tesseract
初始化tesseractEngine(注释的是白名单(能识别到的)和黑名单(不识别的))
- private TesseractEngine tesseractEngine;
- baseDirectory = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
- datapath = Path.Combine(baseDirectory, "tessdata");
- tesseractEngine = new TesseractEngine(datapath, "eng", EngineMode.Default);
-
- //tesseractEngine.SetVariable("tessedit_char_whitelist", "0123456789");
- //tesseractEngine.SetVariable("tessedit_char_blacklist", "!?@#$%&*()<>_-+=/:;'\"");
获取文字
confidence是识别率
//Bitmap bitmap = new Bitmap(fileName);
- public string GetText(Bitmap bitmap, out float confidence)
- {
- var page = tesseractEngine.Process(bitmap);
- var text = page.GetText();
- confidence = page.GetMeanConfidence();
- page.Dispose();
- return text;
- }
从Pdf中获取文字
命名空间:
- using iTextSharp.text.pdf;
- using iTextSharp.text.pdf.parser;
需要NeGet的包:iTextSharp
- public string ReadPdfContent(string filePath)
- {
- PdfReader pdfReader = new PdfReader(filePath);
- string text = string.Empty;
-
- for (int i = 1; i <= pdfReader.NumberOfPages; i++)
- {
- ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
- var temp = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
- text += temp;
- }
- pdfReader.Close();
-
- return text;
- }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。