OCR: Recognize text in an image
In This Topic
The
OcrEngine.Recognize method must be called for the text recognition in the image. The method takes as input parameter the image with text and outputs the recognition results. The method performs the recognition in the working thread.
Here is C#/VB.NET code that shows how to recognize text in image:
string imageFilePath = @"D:\TestImage.png";
// create the OCR engine
using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
{
// specify that OCR engine will recognize German text
Vintasoft.Imaging.Ocr.OcrLanguage language = Vintasoft.Imaging.Ocr.OcrLanguage.German;
// create the OCR engine settings
Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings settings =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language);
// initialize the OCR engine
tesseractOcr.Init(settings);
// load an image with text
using (Vintasoft.Imaging.VintasoftImage image = new Vintasoft.Imaging.VintasoftImage(imageFilePath))
{
// specify the image, where text must be recognized
tesseractOcr.SetImage(image);
// recognize text in the image
Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = tesseractOcr.Recognize();
// get the recognized text
string ocrResultAsText = ocrResult.GetText();
string textFilePath = System.IO.Path.Combine(
System.IO.Path.GetDirectoryName(imageFilePath),
System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
// save the recognition results
System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8);
// clear the image
tesseractOcr.ClearImage();
}
// shutdown the OCR engine
tesseractOcr.Shutdown();
}
Dim imageFilePath As String = "D:\TestImage.png"
' create the OCR engine
Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
' specify that OCR engine will recognize German text
Dim language As Vintasoft.Imaging.Ocr.OcrLanguage = Vintasoft.Imaging.Ocr.OcrLanguage.German
' create the OCR engine settings
Dim settings As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language)
' initialize the OCR engine
tesseractOcr.Init(settings)
' load an image with text
Using image As New Vintasoft.Imaging.VintasoftImage(imageFilePath)
' specify the image, where text must be recognized
tesseractOcr.SetImage(image)
' recognize text in the image
Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = tesseractOcr.Recognize()
' get the recognized text
Dim ocrResultAsText As String = ocrResult.GetText()
Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
' save the recognition results
System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8)
' clear the image
tesseractOcr.ClearImage()
End Using
' shutdown the OCR engine
tesseractOcr.Shutdown()
End Using
The OCR engine binarizes the image before the text recognition, if
OcrEngine.Binarization property defines the image binarization command. The binarization will be disabled if value of the
OcrEngine.Binarization property is set to null. The image binarization before text recognition in most cases improves the quality of text recognition for the document images. For mixed content images sometimes it is better do not perform image binarization before running OCR.
The
OcrEngine.Recognize method performs the text recognition in the working thread. In a real situation, to speed up the recognition process, might be necessary to perform the text recognition in several threads. The same way
OcrEngine.Recognize method performs the text recognition in a single language. In a real situation might be necessary to perform the text recognition in several languages at once.
OcrEngineManager class is implemented to solve these problems.
OcrEngineManager class allows to:
-
Divide the image with text to several regions with text. For each region it is possible to:
-
Recognize text in several regions
- By default the text recognition is performed in a single thread using only one instance of OCR engine. It is possible to recognize the text in several threads using several instances of OCR engine.
Text recognition in several threads
By default the text recognition is performed in a single thread using only one instance of OCR engine. It is possible to recognize the text in several threads using several instances of OCR engine.
Here is C#/VB.NET code that shows how to recognize text in several regions of the same image in two threads:
string imageFilePath = @"D:\TestImage.png";
// create the main OCR engine
using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
{
// create the additional OCR engine
using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr additionalEngine = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
{
// create the OCR engine manager
Vintasoft.Imaging.Ocr.OcrEngineManager engineManager =
new Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngine);
// load an image with text
using (Vintasoft.Imaging.VintasoftImage image = new Vintasoft.Imaging.VintasoftImage(imageFilePath))
{
// create the OCR engine settings
Vintasoft.Imaging.Ocr.OcrEngineSettings settings = new Vintasoft.Imaging.Ocr.OcrEngineSettings();
// create the regions, where text must be searched
Vintasoft.Imaging.Ocr.RecognitionRegion[] regions = new Vintasoft.Imaging.Ocr.RecognitionRegion[] {
new Vintasoft.Imaging.Ocr.RecognitionRegion(
new Vintasoft.Imaging.RegionOfInterest(0,0, 319,80), Vintasoft.Imaging.Ocr.OcrLanguage.English),
new Vintasoft.Imaging.Ocr.RecognitionRegion(
new Vintasoft.Imaging.RegionOfInterest(0, 330,319,85), Vintasoft.Imaging.Ocr.OcrLanguage.English, 180) };
// recognize text in the image regions
Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = engineManager.Recognize(image, settings, regions);
// get the recognized text
string ocrResultAsText = ocrResult.GetText();
string textFilePath = System.IO.Path.Combine(
System.IO.Path.GetDirectoryName(imageFilePath),
System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
// save the recognition results
System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8);
}
}
}
Dim imageFilePath As String = "D:\TestImage.png"
' create the main OCR engine
Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
' create the additional OCR engine
Using additionalEngine As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
' create the OCR engine manager
Dim engineManager As New Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngine)
' load an image with text
Using image As New Vintasoft.Imaging.VintasoftImage(imageFilePath)
' create the OCR engine settings
Dim settings As New Vintasoft.Imaging.Ocr.OcrEngineSettings()
' create the regions, where text must be searched
Dim regions As Vintasoft.Imaging.Ocr.RecognitionRegion() = New Vintasoft.Imaging.Ocr.RecognitionRegion() {New Vintasoft.Imaging.Ocr.RecognitionRegion(New Vintasoft.Imaging.RegionOfInterest(0, 0, 319, 80), Vintasoft.Imaging.Ocr.OcrLanguage.English), New Vintasoft.Imaging.Ocr.RecognitionRegion(New Vintasoft.Imaging.RegionOfInterest(0, 330, 319, 85), Vintasoft.Imaging.Ocr.OcrLanguage.English, 180)}
' recognize text in the image regions
Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = engineManager.Recognize(image, settings, regions)
' get the recognized text
Dim ocrResultAsText As String = ocrResult.GetText()
Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
' save the recognition results
System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8)
End Using
End Using
End Using
Here is C#/VB.NET code that shows how to recognize text in several images in three threads:
string imageFilePath = @"D:\TestImage.png";
// create the OCR engine
using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
{
// create an array for additional OCR engines
Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[] additionalEngines =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[2];
try
{
// for each additional OCR engine
for (int i = 0; i < additionalEngines.Length; i++)
// create the additional OCR engine
additionalEngines[i] = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr();
// create the OCR engine manager
Vintasoft.Imaging.Ocr.OcrEngineManager engineManager =
new Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines);
// create an image collection
using (Vintasoft.Imaging.ImageCollection images =
new Vintasoft.Imaging.ImageCollection())
{
// load the images from file
images.Add(imageFilePath);
// create the OCR engine settings
Vintasoft.Imaging.Ocr.OcrEngineSettings settings = new Vintasoft.Imaging.Ocr.OcrEngineSettings();
// recognize text in images
Vintasoft.Imaging.Ocr.Results.OcrDocument ocrResults = engineManager.Recognize(images, settings);
// create string with the recognition results
System.Text.StringBuilder documentContent = new System.Text.StringBuilder();
// for each recognized page
for (int i = 0; i < ocrResults.Pages.Count; i++)
{
documentContent.AppendFormat("Page {0}", i + 1);
documentContent.AppendLine();
documentContent.AppendLine();
documentContent.AppendLine(ocrResults.Pages[i].GetText());
if (i != ocrResults.Pages.Count - 1)
{
documentContent.AppendLine();
documentContent.AppendLine();
documentContent.AppendLine();
}
}
string textFilePath = System.IO.Path.Combine(
System.IO.Path.GetDirectoryName(imageFilePath),
System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
// save the recognition results
System.IO.File.WriteAllText(textFilePath, documentContent.ToString(), System.Text.Encoding.UTF8);
// clear and dispose images
images.ClearAndDisposeItems();
}
}
finally
{
// for each additional OCR engine
for (int i = 0; i < additionalEngines.Length; i++)
{
if (additionalEngines[i] != null)
// dispose the additional OCR engine
additionalEngines[i].Dispose();
}
}
}
Dim imageFilePath As String = "D:\TestImage.png"
' create the OCR engine
Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
' create an array for additional OCR engines
Dim additionalEngines As Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr() = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(1) {}
Try
' for each additional OCR engine
For i As Integer = 0 To additionalEngines.Length - 1
' create the additional OCR engine
additionalEngines(i) = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
Next
' create the OCR engine manager
Dim engineManager As New Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines)
' create an image collection
Using images As New Vintasoft.Imaging.ImageCollection()
' load the images from file
images.Add(imageFilePath)
' create the OCR engine settings
Dim settings As New Vintasoft.Imaging.Ocr.OcrEngineSettings()
' recognize text in images
Dim ocrResults As Vintasoft.Imaging.Ocr.Results.OcrDocument = engineManager.Recognize(images, settings)
' create string with the recognition results
Dim documentContent As New System.Text.StringBuilder()
' for each recognized page
For i As Integer = 0 To ocrResults.Pages.Count - 1
documentContent.AppendFormat("Page {0}", i + 1)
documentContent.AppendLine()
documentContent.AppendLine()
documentContent.AppendLine(ocrResults.Pages(i).GetText())
If i <> ocrResults.Pages.Count - 1 Then
documentContent.AppendLine()
documentContent.AppendLine()
documentContent.AppendLine()
End If
Next
Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
' save the recognition results
System.IO.File.WriteAllText(textFilePath, documentContent.ToString(), System.Text.Encoding.UTF8)
' clear and dispose images
images.ClearAndDisposeItems()
End Using
Finally
' for each additional OCR engine
For i As Integer = 0 To additionalEngines.Length - 1
If additionalEngines(i) IsNot Nothing Then
' dispose the additional OCR engine
additionalEngines(i).Dispose()
End If
Next
End Try
End Using
Text recognition in very large image
Tesseract OCR creates a copy of image data before text recognition. This may lead to a huge use of memory or even memory overflow in case the image is very large.
OcrEngineManager class allows to divide the very large image into several not so large images, recognize the text in the resulting images and unite the OCR results into one total result.
Here is C#/VB.NET code that shows how to recognize text in a very large image (20000x30000, 24-bpp, 600 dpi):
string imageFilePath = @"D:\TestImage.png";
// create the OCR engine
using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
{
// create an array for additional OCR engines
Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[] additionalEngines = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[3];
try
{
// for each additional OCR engine
for (int i = 0; i < additionalEngines.Length; i++)
// create the additional OCR engine
additionalEngines[i] = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr();
// create the OCR engine manager
Vintasoft.Imaging.Ocr.OcrEngineManager engineManager = new Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines);
// load an image from file
using (Vintasoft.Imaging.VintasoftImage image = new Vintasoft.Imaging.VintasoftImage(imageFilePath))
{
// create the recognition region splitting settings
Vintasoft.Imaging.Ocr.OcrRecognitionRegionSplittingSettings splittingSettings =
new Vintasoft.Imaging.Ocr.OcrRecognitionRegionSplittingSettings(5000, 5000, 2000);
// specify that the recognition regions must be splitted into sub regions
engineManager.RecognitionRegionSplittingSettings = splittingSettings;
// create the OCR engine settings
Vintasoft.Imaging.Ocr.OcrEngineSettings settings = new Vintasoft.Imaging.Ocr.OcrEngineSettings();
// recognize text in image
Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = engineManager.Recognize(image, settings);
// get the recognized text
string ocrResultAsText = ocrResult.GetText();
string textFilePath = System.IO.Path.Combine(
System.IO.Path.GetDirectoryName(imageFilePath),
System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
// save the recognized text in file
System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8);
}
}
finally
{
// for each additional OCR engine
for (int i = 0; i < additionalEngines.Length; i++)
{
if (additionalEngines[i] != null)
// dispose the additional OCR engine
additionalEngines[i].Dispose();
}
}
}
Dim imageFilePath As String = "D:\TestImage.png"
' create the OCR engine
Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
' create an array for additional OCR engines
Dim additionalEngines As Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr() = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(2) {}
Try
' for each additional OCR engine
For i As Integer = 0 To additionalEngines.Length - 1
' create the additional OCR engine
additionalEngines(i) = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
Next
' create the OCR engine manager
Dim engineManager As New Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines)
' load an image from file
Using image As New Vintasoft.Imaging.VintasoftImage(imageFilePath)
' create the recognition region splitting settings
Dim splittingSettings As New Vintasoft.Imaging.Ocr.OcrRecognitionRegionSplittingSettings(5000, 5000, 2000)
' specify that the recognition regions must be splitted into sub regions
engineManager.RecognitionRegionSplittingSettings = splittingSettings
' create the OCR engine settings
Dim settings As New Vintasoft.Imaging.Ocr.OcrEngineSettings()
' recognize text in image
Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = engineManager.Recognize(image, settings)
' get the recognized text
Dim ocrResultAsText As String = ocrResult.GetText()
Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
' save the recognized text in file
System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8)
End Using
Finally
' for each additional OCR engine
For i As Integer = 0 To additionalEngines.Length - 1
If additionalEngines(i) IsNot Nothing Then
' dispose the additional OCR engine
additionalEngines(i).Dispose()
End If
Next
End Try
End Using