VintaSoft Imaging .NET SDK v8.7
In This Topic
    OCR: Prepare OCR engine for text recognition
    In This Topic

    Create an instance of TesseractOcr class

    For text recognition using Tesseract OCR it is necessary the following files:

    IMPORTANT: Microsoft Visual C++ 2010 Redistributable Package (32-bit version) must be installed on computer for correct work of Tesseract3.Vintasoft.x86.dll. Microsoft Visual C++ 2010 Redistributable Package (64-bit version) must be installed on computer for correct work of Tesseract3.Vintasoft.x64.dll.


    By default the Tesseract OCR files are located in the "TesseractOcr" directory and the directory has the following structure:

    An instance of TesseractOcr class must be created for text recognition using Tesseract OCR. The default constructor (constructor without parameters) takes the directory "\TesseractOcr\" as the directory where are located Tesseract OCR files. To specify another directory is necessary to use the constructor with string parameter.

    Here is an example that shows how to create an instance of TesseractOcr class and specify the directory where are located Tesseract OCR files:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging.Ocr.Tesseract
    
    Dim tesseractOcrDllPath As String = "D:\Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr"
    Dim tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(tesseractOcrDllPath)
    
    ' ...
    
    tesseractOcr.Dispose()
                  
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging.Ocr.Tesseract
    
    string tesseractOcrDllPath = @"D:\Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr";
    Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
        new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(tesseractOcrDllPath);
    
    // ...
    
    tesseractOcr.Dispose();
                    
    


    Initialize an instance of TesseractOcr class

    After an instance of TesseractOcr class is created it is necessary to call TesseractOcr.Init method to initialize the instance. The TesseractOcr.Init method allows to specify the the default language for text recognition.

    Tesseract OCR 3.04 allows to recognize text in more than 100 languages. The installation package of the SDK by default includes only the English language dictionary.

    The following table shows a complete list of languages supported by Tesseract OCR 3.04 and links for downloading the language files:

    Here is an example that shows how to specify German language as main language to be used for text recognition:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging.Ocr
    ' - Vintasoft.Imaging.Ocr.Tesseract
    
    ' create the OCR engine
    Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
            ' specify that OCR engine will recognize German text
            Dim language As Vintasoft.Imaging.Ocr.OcrLanguage = Vintasoft.Imaging.Ocr.OcrLanguage.German
            ' create the OCR engine settings
            Dim settings As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language)
            ' initialize the OCR engine
            tesseractOcr.Init(settings)
    
            ' ...
    
            ' shutdown the OCR engine
            tesseractOcr.Shutdown()
    End Using
                  
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging.Ocr
    // - Vintasoft.Imaging.Ocr.Tesseract
    
    // create the OCR engine
    using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
        new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
    {
        // specify that OCR engine will recognize German text
        Vintasoft.Imaging.Ocr.OcrLanguage language = Vintasoft.Imaging.Ocr.OcrLanguage.German;
        // create the OCR engine settings
        Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings settings = 
            new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language);
        // initialize the OCR engine
        tesseractOcr.Init(settings);
    
        // ...
    
        // shutdown the OCR engine
        tesseractOcr.Shutdown();
    }
                    
    


    Parameters of Tesseract OCR

    Tesseract OCR 3.04 has a lot of parameters. The list of useful parameters can be obtained here: http://github.com/tesseract-ocr/tesseract/wiki/ControlParams. Full list of parameters for version 3.02 can be obtained here: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version.

    You can get the value of any parameter using TesseractOcr.GetVariable method.
    You can set the value of any parameter using TesseractOcr.SetVariable method.

    Here is an example that shows how to use the parameter "char whitelist":
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging
    ' - Vintasoft.Imaging.Ocr
    ' - Vintasoft.Imaging.Ocr.Tesseract
    
    Class TesseractOcrSetVariableExample
            ''' <summary>
            ''' Specifies that text contains only the limited set of characters and
            ''' recognizes the text in image.
            ''' </summary>
            ''' <param name="filename">The name of file, which stores images with text.</param>
            Public Shared Sub OcrImageWithDigits(filename As String)
                    ' create an image collection
                    Using images As New Vintasoft.Imaging.ImageCollection()
                            ' add images from file to the image collection
                            images.Add(filename)
    
                            System.Console.WriteLine("Create Tesseract OCR engine...")
                            ' create the Tesseract OCR engine
                            Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
                                    System.Console.WriteLine("Initialize OCR engine...")
                                    ' init the Tesseract OCR engine
                                    tesseractOcr.Init(New Vintasoft.Imaging.Ocr.OcrEngineSettings(Vintasoft.Imaging.Ocr.OcrLanguage.English))
    
                                    ' set the "white list" of recognizing characters
                                    tesseractOcr.SetVariable("tessedit_char_whitelist", "01234567890")
    
                                    ' for each image
                                    For Each image As Vintasoft.Imaging.VintasoftImage In images
                                            System.Console.WriteLine("Recognize the image...")
    
                                            ' recognize text in image
                                            Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = tesseractOcr.Recognize(image)
    
                                            ' output the recognized text
    
                                            System.Console.WriteLine("Page Text:")
                                            System.Console.WriteLine(ocrResult.GetText())
                                            System.Console.WriteLine()
                                    Next
    
                                    ' shutdown the Tesseract OCR engine
                                    tesseractOcr.Shutdown()
                            End Using
    
                            ' free images
                            images.ClearAndDisposeItems()
                    End Using
            End Sub
    End Class
                  
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging
    // - Vintasoft.Imaging.Ocr
    // - Vintasoft.Imaging.Ocr.Tesseract
    
    class TesseractOcrSetVariableExample
    {
        /// <summary>
        /// Specifies that text contains only the limited set of characters and
        /// recognizes the text in image.
        /// </summary>
        /// <param name="filename">The name of file, which stores images with text.</param>
        public static void OcrImageWithDigits(string filename)
        {
            // create an image collection
            using (Vintasoft.Imaging.ImageCollection images = 
                new Vintasoft.Imaging.ImageCollection())
            {
                // add images from file to the image collection
                images.Add(filename);
    
                System.Console.WriteLine("Create Tesseract OCR engine...");
                // create the Tesseract OCR engine
                using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
                    new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
                {
                    System.Console.WriteLine("Initialize OCR engine...");
                    // init the Tesseract OCR engine
                    tesseractOcr.Init(new Vintasoft.Imaging.Ocr.OcrEngineSettings(
                        Vintasoft.Imaging.Ocr.OcrLanguage.English));
    
                    // set the "white list" of recognizing characters
                    tesseractOcr.SetVariable(
                        "tessedit_char_whitelist", "01234567890");
    
                    // for each image
                    foreach (Vintasoft.Imaging.VintasoftImage image in images)
                    {
                        System.Console.WriteLine("Recognize the image...");
    
                        // recognize text in image
                        Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = tesseractOcr.Recognize(image);
    
                        // output the recognized text
    
                        System.Console.WriteLine("Page Text:");
                        System.Console.WriteLine(ocrResult.GetText());
                        System.Console.WriteLine();
                    }
    
                    // shutdown the Tesseract OCR engine
                    tesseractOcr.Shutdown();
                }
    
                // free images
                images.ClearAndDisposeItems();
            }
        }
    }