VintaSoft Imaging .NET SDK 14.0: Documentation for .NET developer
In This Topic
    OCR: Prepare OCR engine for text recognition
    In This Topic

    Create an instance of TesseractOcr class

    For text recognition using Tesseract OCR are necessary the following files:
    By default the Tesseract OCR files are located in the "TesseractOcr" directory and the directory has the following structure:
    An instance of TesseractOcr class must be created for text recognition using Tesseract OCR. The default constructor (constructor without parameters) takes the directory "<current_directory>\TesseractOcr\" as the directory where are located Tesseract OCR files. To specify another directory is necessary to use the constructor with string parameter.

    Here is C#/VB.NET code that shows how to create an instance of TesseractOcr class and specify the directory where are located Tesseract OCR files:
    string tesseractOcrDllPath = @"D:\Vintasoft.Imaging.Ocr.Tesseract.">TesseractOcr";
    Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
        new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(tesseractOcrDllPath);
    
    // ...
    
    tesseractOcr.Dispose();
    
    Dim tesseractOcrDllPath As String = "D:\Vintasoft.Imaging.Ocr.Tesseract.">TesseractOcr"
    Dim tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(tesseractOcrDllPath)
    
    ' ...
    
    tesseractOcr.Dispose()
    


    Initialize an instance of TesseractOcr class

    After an instance of TesseractOcr class is created it is necessary to call TesseractOcr.Init method to initialize the instance. The TesseractOcr.Init method allows to specify the the default language for text recognition.

    Tesseract OCR 5 allows to recognize text in more than 100 languages. The installation package of the SDK by default includes only the English language dictionary.

    The following table shows a complete list of languages supported by Tesseract OCR 5 and links for downloading the language files:
    Language Tesseract 5.0 (fast) dictionary Tesseract 5.0 (best) dictionary Tesseract 5.0 (standard) dictionary
    Afrikaans Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Amharic Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Arabic Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Assamese Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Azerbaijani Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Azerbaijani Cyrilic Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Belarusian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Bengali Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Tibetan Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Bosnian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Bre Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Bulgarian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Catalan Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Cebuano Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Czech Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Chinese Simplified Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Chinese Simplified vertical Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Chinese Traditional Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Chinese Traditional vertical Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Cherokee Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Cos Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Welsh Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Danish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Danish Fraktur - - Tesseract 5.0 (standard)
    German Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    German Fraktur - - Tesseract 5.0 (standard)
    Div Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Dzongkha Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Greek Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    English Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    English Middle Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Esperanto Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Math Equation Detection Module - - Tesseract 5.0 (standard)
    Estonian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Basque Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Fao Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Persian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Fil Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Finnish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    French Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Frankish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    French Middle Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Fry Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Gla Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Irish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Galician Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Greek Ancient Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Gujarati Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Haitian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Hebrew Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Hindi Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Croatian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Hungarian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Hye Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Inuktitut Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Indonesian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Icelandic Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Italian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Italian Old Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Javanese Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Japanese Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Japanese vertical Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Kannada Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Georgian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Georgian Old Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Kazakh Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Central Khmer Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Kirghiz Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Kmr Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Korean Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Korean vertical Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Lao Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Latin Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Latvian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Lithuanian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Ltz Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Malayalam Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Marathi Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Macedonian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Maltese Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    MICR (magnetic ink character recognition) Tesseract 5.0 (fast) Tesseract 5.0 (best) -
    Mon Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Mri Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    MRZ (machine-readable zone) Tesseract 5.0 (fast) Tesseract 5.0 (best) -
    Malay Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Burmese Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Nepali Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Dutch Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Norwegian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Oci Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Oriya Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Osd Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Panjabi Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Polish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Portuguese Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Pushto Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Que Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Romanian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Russian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Sanskrit Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Sinhala Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Slovakian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Slovak Fraktur - - Tesseract 5.0 (standard)
    Slovenian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Snd Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Spanish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Spanish Castilian Old Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Albanian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Serbian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Serbian Latin Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Sun Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Swahili Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Swedish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Syriac Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Tamil Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Tat Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Telugu Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Tajik Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Tagalog - - Tesseract 5.0 (standard)
    Thai Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Tigrinya Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Ton Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Turkish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Uighur Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Ukrainian Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Urdu Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Uzbek Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Uzbek Cyrilic Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Vietnamese Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Yiddish Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)
    Yor Tesseract 5.0 (fast) Tesseract 5.0 (best) Tesseract 5.0 (standard)


    Here is C#/VB.NET code that shows how to specify German language as main language to be used for text recognition:
    // create the OCR engine
    using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
        new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
    {
        // specify that OCR engine will recognize German text
        Vintasoft.Imaging.Ocr.OcrLanguage language = Vintasoft.Imaging.Ocr.OcrLanguage.German;
        // create the OCR engine settings
        Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings settings = 
            new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language);
        // initialize the OCR engine
        tesseractOcr.Init(settings);
    
        // ...
    
        // shutdown the OCR engine
        tesseractOcr.Shutdown();
    }
    
    ' create the OCR engine
    Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
        ' specify that OCR engine will recognize German text
        Dim language As Vintasoft.Imaging.Ocr.OcrLanguage = Vintasoft.Imaging.Ocr.OcrLanguage.German
        ' create the OCR engine settings
        Dim settings As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language)
        ' initialize the OCR engine
        tesseractOcr.Init(settings)
    
        ' ...
    
        ' shutdown the OCR engine
        tesseractOcr.Shutdown()
    End Using
    


    Parameters of Tesseract OCR

    Tesseract OCR 5 has a lot of parameters.

    You can get the value of any parameter using TesseractOcr.GetVariable method.
    You can set the value of any parameter using TesseractOcr.SetVariable method.

    Here is C#/VB.NET code that shows how to use the parameter "char whitelist":
    /// <summary>
    /// Specifies that text contains only the limited set of characters and
    /// recognizes the text in image.
    /// </summary>
    /// <param name="filename">The name of file, which stores images with text.</param>
    public static void OcrImageWithDigits(string filename)
    {
        // create an image collection
        using (Vintasoft.Imaging.ImageCollection images = 
            new Vintasoft.Imaging.ImageCollection())
        {
            // add images from file to the image collection
            images.Add(filename);
    
            System.Console.WriteLine("Create Tesseract OCR engine...");
            // create the Tesseract OCR engine
            using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
                new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
            {
                System.Console.WriteLine("Initialize OCR engine...");
                // init the Tesseract OCR engine
                tesseractOcr.Init(new Vintasoft.Imaging.Ocr.OcrEngineSettings(
                    Vintasoft.Imaging.Ocr.OcrLanguage.English));
    
                // set the "white list" of recognizing characters
                tesseractOcr.SetVariable(
                    "tessedit_char_whitelist", "01234567890");
    
                // for each image
                foreach (Vintasoft.Imaging.VintasoftImage image in images)
                {
                    System.Console.WriteLine("Recognize the image...");
    
                    // recognize text in image
                    Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = tesseractOcr.Recognize(image);
    
                    // output the recognized text
    
                    System.Console.WriteLine("Page Text:");
                    System.Console.WriteLine(ocrResult.GetText());
                    System.Console.WriteLine();
                }
    
                // shutdown the Tesseract OCR engine
                tesseractOcr.Shutdown();
            }
    
            // free images
            images.ClearAndDisposeItems();
        }
    }
    
    ''' <summary>
    ''' Specifies that text contains only the limited set of characters and
    ''' recognizes the text in image.
    ''' </summary>
    ''' <param name="filename">The name of file, which stores images with text.</param>
    Public Shared Sub OcrImageWithDigits(filename As String)
        ' create an image collection
        Using images As New Vintasoft.Imaging.ImageCollection()
            ' add images from file to the image collection
            images.Add(filename)
    
            System.Console.WriteLine("Create Tesseract OCR engine...")
            ' create the Tesseract OCR engine
            Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
                System.Console.WriteLine("Initialize OCR engine...")
                ' init the Tesseract OCR engine
                tesseractOcr.Init(New Vintasoft.Imaging.Ocr.OcrEngineSettings(Vintasoft.Imaging.Ocr.OcrLanguage.English))
    
                ' set the "white list" of recognizing characters
                tesseractOcr.SetVariable("tessedit_char_whitelist", "01234567890")
    
                ' for each image
                For Each image As Vintasoft.Imaging.VintasoftImage In images
                    System.Console.WriteLine("Recognize the image...")
    
                    ' recognize text in image
                    Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = tesseractOcr.Recognize(image)
    
                    ' output the recognized text
    
                    System.Console.WriteLine("Page Text:")
                    System.Console.WriteLine(ocrResult.GetText())
                    System.Console.WriteLine()
                Next
    
                ' shutdown the Tesseract OCR engine
                tesseractOcr.Shutdown()
            End Using
    
            ' free images
            images.ClearAndDisposeItems()
        End Using
    End Sub