OCR: Prepare OCR engine for text recognition
In This Topic
Create an instance of TesseractOcr class
For text recognition using Tesseract OCR are necessary the following files:
-
The Tesseract5.Vintasoft.x86.dll file if Tesseract OCR must be used in 32-bit Windows application. IMPORTANT: Microsoft Visual C++ 2019 Redistributable Package (32-bit version) must be installed on Windows computer for correct work of Tesseract5.Vintasoft.x86.dll.
-
The Tesseract5.Vintasoft.x64.dll file if Tesseract OCR must be used in 64-bit Windows application. IMPORTANT: Microsoft Visual C++ 2019 Redistributable Package (64-bit version) must be installed on Windows computer for correct work of Tesseract5.Vintasoft.x64.dll.
- The libTesseract5.Vintasoft.x64.so file if Tesseract OCR must be used in 64-bit Linux application.
- The "tessdata" directory with language files.
By default the Tesseract OCR files are located in the "TesseractOcr" directory and the directory has the following structure:
-
TesseractOcr
- Tesseract5.Vintasoft.x86.dll
- Tesseract5.Vintasoft.x64.dll
-
tessdata
- eng.traineddata
- deu.traineddata
- ...
An instance of
TesseractOcr class must be created for text recognition using Tesseract OCR. The default constructor (constructor without parameters) takes the directory "<current_directory>\TesseractOcr\" as the directory where are located Tesseract OCR files. To specify another directory is necessary to use the constructor with string parameter.
Here is C#/VB.NET code that shows how to create an instance of
TesseractOcr class and specify the directory where are located Tesseract OCR files:
string tesseractOcrDllPath = @"D:\Vintasoft.Imaging.Ocr.Tesseract.">TesseractOcr";
Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(tesseractOcrDllPath);
// ...
tesseractOcr.Dispose();
Dim tesseractOcrDllPath As String = "D:\Vintasoft.Imaging.Ocr.Tesseract.">TesseractOcr"
Dim tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(tesseractOcrDllPath)
' ...
tesseractOcr.Dispose()
Initialize an instance of TesseractOcr class
After an instance of
TesseractOcr class is created it is necessary to call
TesseractOcr.Init method to initialize the instance. The
TesseractOcr.Init method allows to specify the the default language for text recognition.
Tesseract OCR 5 allows to recognize text in more than 100 languages. The installation package of the SDK by default includes only the English language dictionary.
The following table shows a complete list of languages supported by Tesseract OCR 5 and links for downloading the language files:
Here is C#/VB.NET code that shows how to specify German language as main language to be used for text recognition:
// create the OCR engine
using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
{
// specify that OCR engine will recognize German text
Vintasoft.Imaging.Ocr.OcrLanguage language = Vintasoft.Imaging.Ocr.OcrLanguage.German;
// create the OCR engine settings
Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings settings =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language);
// initialize the OCR engine
tesseractOcr.Init(settings);
// ...
// shutdown the OCR engine
tesseractOcr.Shutdown();
}
' create the OCR engine
Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
' specify that OCR engine will recognize German text
Dim language As Vintasoft.Imaging.Ocr.OcrLanguage = Vintasoft.Imaging.Ocr.OcrLanguage.German
' create the OCR engine settings
Dim settings As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language)
' initialize the OCR engine
tesseractOcr.Init(settings)
' ...
' shutdown the OCR engine
tesseractOcr.Shutdown()
End Using
Parameters of Tesseract OCR
Tesseract OCR 5 has a lot of parameters.
You can get the value of any parameter using
TesseractOcr.GetVariable method.
You can set the value of any parameter using
TesseractOcr.SetVariable method.
Here is C#/VB.NET code that shows how to use the parameter "char whitelist":
/// <summary>
/// Specifies that text contains only the limited set of characters and
/// recognizes the text in image.
/// </summary>
/// <param name="filename">The name of file, which stores images with text.</param>
public static void OcrImageWithDigits(string filename)
{
// create an image collection
using (Vintasoft.Imaging.ImageCollection images =
new Vintasoft.Imaging.ImageCollection())
{
// add images from file to the image collection
images.Add(filename);
System.Console.WriteLine("Create Tesseract OCR engine...");
// create the Tesseract OCR engine
using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr =
new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
{
System.Console.WriteLine("Initialize OCR engine...");
// init the Tesseract OCR engine
tesseractOcr.Init(new Vintasoft.Imaging.Ocr.OcrEngineSettings(
Vintasoft.Imaging.Ocr.OcrLanguage.English));
// set the "white list" of recognizing characters
tesseractOcr.SetVariable(
"tessedit_char_whitelist", "01234567890");
// for each image
foreach (Vintasoft.Imaging.VintasoftImage image in images)
{
System.Console.WriteLine("Recognize the image...");
// recognize text in image
Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = tesseractOcr.Recognize(image);
// output the recognized text
System.Console.WriteLine("Page Text:");
System.Console.WriteLine(ocrResult.GetText());
System.Console.WriteLine();
}
// shutdown the Tesseract OCR engine
tesseractOcr.Shutdown();
}
// free images
images.ClearAndDisposeItems();
}
}
''' <summary>
''' Specifies that text contains only the limited set of characters and
''' recognizes the text in image.
''' </summary>
''' <param name="filename">The name of file, which stores images with text.</param>
Public Shared Sub OcrImageWithDigits(filename As String)
' create an image collection
Using images As New Vintasoft.Imaging.ImageCollection()
' add images from file to the image collection
images.Add(filename)
System.Console.WriteLine("Create Tesseract OCR engine...")
' create the Tesseract OCR engine
Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
System.Console.WriteLine("Initialize OCR engine...")
' init the Tesseract OCR engine
tesseractOcr.Init(New Vintasoft.Imaging.Ocr.OcrEngineSettings(Vintasoft.Imaging.Ocr.OcrLanguage.English))
' set the "white list" of recognizing characters
tesseractOcr.SetVariable("tessedit_char_whitelist", "01234567890")
' for each image
For Each image As Vintasoft.Imaging.VintasoftImage In images
System.Console.WriteLine("Recognize the image...")
' recognize text in image
Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = tesseractOcr.Recognize(image)
' output the recognized text
System.Console.WriteLine("Page Text:")
System.Console.WriteLine(ocrResult.GetText())
System.Console.WriteLine()
Next
' shutdown the Tesseract OCR engine
tesseractOcr.Shutdown()
End Using
' free images
images.ClearAndDisposeItems()
End Using
End Sub