VintaSoft Imaging .NET SDK 10.0
In This Topic
    OCR: Recognize text in an image
    In This Topic

    Text recognition using OcrEngine class (TesseractOcr)

    The OcrEngine.Recognize method must be called for the text recognition in the image. The method takes as input parameter the image with text and outputs the recognition results. The method performs the recognition in the working thread.

    Here is an example that shows how to recognize text in image:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging
    ' - Vintasoft.Imaging.Ocr
    ' - Vintasoft.Imaging.Ocr.Tesseract
    
    Dim imageFilePath As String = "D:\TestImage.png"
    ' create the OCR engine
    Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
            ' specify that OCR engine will recognize German text
            Dim language As Vintasoft.Imaging.Ocr.OcrLanguage = Vintasoft.Imaging.Ocr.OcrLanguage.German
            ' create the OCR engine settings
            Dim settings As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language)
            ' initialize the OCR engine
            tesseractOcr.Init(settings)
    
            ' load an image with text
            Using image As New Vintasoft.Imaging.VintasoftImage(imageFilePath)
                    ' specify the image, where text must be recognized
                    tesseractOcr.SetImage(image)
    
                    ' recognize text in the image
                    Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = tesseractOcr.Recognize()
    
                    ' get the recognized text
                    Dim ocrResultAsText As String = ocrResult.GetText()
    
                    Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
                    ' save the recognition results
                    System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8)
    
                    ' clear the image
                    tesseractOcr.ClearImage()
            End Using
            ' shutdown the OCR engine
            tesseractOcr.Shutdown()
    End Using
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging
    // - Vintasoft.Imaging.Ocr
    // - Vintasoft.Imaging.Ocr.Tesseract
    
    string imageFilePath = @"D:\TestImage.png";
    // create the OCR engine
    using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
        new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
    {
        // specify that OCR engine will recognize German text
        Vintasoft.Imaging.Ocr.OcrLanguage language = Vintasoft.Imaging.Ocr.OcrLanguage.German;
        // create the OCR engine settings
        Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings settings = 
            new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language);
        // initialize the OCR engine
        tesseractOcr.Init(settings);
    
        // load an image with text
        using (Vintasoft.Imaging.VintasoftImage image = new Vintasoft.Imaging.VintasoftImage(imageFilePath))
        {
            // specify the image, where text must be recognized
            tesseractOcr.SetImage(image);
    
            // recognize text in the image
            Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = tesseractOcr.Recognize();
    
            // get the recognized text
            string ocrResultAsText = ocrResult.GetText();
    
            string textFilePath = System.IO.Path.Combine(
                System.IO.Path.GetDirectoryName(imageFilePath),
                System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
            // save the recognition results
            System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8);
    
            // clear the image
            tesseractOcr.ClearImage();
        }
        // shutdown the OCR engine
        tesseractOcr.Shutdown();
    }
    


    The OCR engine binarizes the image before the text recognition, if OcrEngine.Binarization property defines the image binarization command. The binarization will be disabled if value of the OcrEngine.Binarization property is set to null. The image binarization before text recognition in most cases improves the quality of text recognition for the document images. For mixed content images sometimes it is better do not perform image binarization before running OCR.


    Text recognition using OcrEngineManager class

    The OcrEngine.Recognize method performs the text recognition in the working thread. In a real situation, to speed up the recognition process, might be necessary to perform the text recognition in several threads. The same way OcrEngine.Recognize method performs the text recognition in a single language. In a real situation might be necessary to perform the text recognition in several languages at once. OcrEngineManager class is implemented to solve these problems.

    OcrEngineManager class allows to:


    Text recognition in several threads

    By default the text recognition is performed in a single thread using only one instance of OCR engine. It is possible to recognize the text in several threads using several instances of OCR engine.

    Here is an example that shows how to recognize text in several regions of the same image in two threads:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging
    ' - Vintasoft.Imaging.Ocr
    ' - Vintasoft.Imaging.Ocr.Tesseract
    
    Dim imageFilePath As String = "D:\TestImage.png"
    ' create the main OCR engine
    Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
            ' create the additional OCR engine
            Using additionalEngine As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
                    ' create the OCR engine manager
                    Dim engineManager As New Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngine)
    
                    ' load an image with text
                    Using image As New Vintasoft.Imaging.VintasoftImage(imageFilePath)
                            ' create the OCR engine settings
                            Dim settings As New Vintasoft.Imaging.Ocr.OcrEngineSettings()
                            ' create the regions, where text must be searched
                            Dim regions As Vintasoft.Imaging.Ocr.RecognitionRegion() = New Vintasoft.Imaging.Ocr.RecognitionRegion() {New Vintasoft.Imaging.Ocr.RecognitionRegion(New Vintasoft.Imaging.RegionOfInterest(0, 0, 319, 80), Vintasoft.Imaging.Ocr.OcrLanguage.English), New Vintasoft.Imaging.Ocr.RecognitionRegion(New Vintasoft.Imaging.RegionOfInterest(0, 330, 319, 85), Vintasoft.Imaging.Ocr.OcrLanguage.English, 180)}
    
                            ' recognize text in the image regions
                            Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = engineManager.Recognize(image, settings, regions)
    
                            ' get the recognized text
                            Dim ocrResultAsText As String = ocrResult.GetText()
    
                            Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
                            ' save the recognition results
                            System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8)
                    End Using
            End Using
    End Using
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging
    // - Vintasoft.Imaging.Ocr
    // - Vintasoft.Imaging.Ocr.Tesseract
    
    string imageFilePath = @"D:\TestImage.png";
    // create the main OCR engine
    using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
    {
        // create the additional OCR engine
        using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr additionalEngine = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
        {
            // create the OCR engine manager
            Vintasoft.Imaging.Ocr.OcrEngineManager engineManager = 
                new Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngine);
    
            // load an image with text
            using (Vintasoft.Imaging.VintasoftImage image = new Vintasoft.Imaging.VintasoftImage(imageFilePath))
            {
                // create the OCR engine settings
                Vintasoft.Imaging.Ocr.OcrEngineSettings settings = new Vintasoft.Imaging.Ocr.OcrEngineSettings();
                // create the regions, where text must be searched
                Vintasoft.Imaging.Ocr.RecognitionRegion[] regions = new Vintasoft.Imaging.Ocr.RecognitionRegion[] {
                    new Vintasoft.Imaging.Ocr.RecognitionRegion(
                        new Vintasoft.Imaging.RegionOfInterest(0,0, 319,80), Vintasoft.Imaging.Ocr.OcrLanguage.English),
                    new Vintasoft.Imaging.Ocr.RecognitionRegion(
                        new Vintasoft.Imaging.RegionOfInterest(0, 330,319,85), Vintasoft.Imaging.Ocr.OcrLanguage.English, 180) };
    
                // recognize text in the image regions
                Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = engineManager.Recognize(image, settings, regions);
    
                // get the recognized text
                string ocrResultAsText = ocrResult.GetText();
    
                string textFilePath = System.IO.Path.Combine(
                    System.IO.Path.GetDirectoryName(imageFilePath),
                    System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
                // save the recognition results
                System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8);
            }
        }
    }
    


    Here is an example that shows how to recognize text in several images in three threads:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging
    ' - Vintasoft.Imaging.Ocr
    ' - Vintasoft.Imaging.Ocr.Tesseract
    
    Dim imageFilePath As String = "D:\TestImage.png"
    ' create the OCR engine
    Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
            ' create an array for additional OCR engines
            Dim additionalEngines As Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr() = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(1) {}
            Try
                    ' for each additional OCR engine
                    For i As Integer = 0 To additionalEngines.Length - 1
                            ' create the additional OCR engine
                            additionalEngines(i) = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
                    Next
    
                    ' create the OCR engine manager
                    Dim engineManager As New Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines)
                    ' create an image collection
                    Using images As New Vintasoft.Imaging.ImageCollection()
                            ' load the images from file
                            images.Add(imageFilePath)
    
                            ' create the OCR engine settings
                            Dim settings As New Vintasoft.Imaging.Ocr.OcrEngineSettings()
    
                            ' recognize text in images
                            Dim ocrResults As Vintasoft.Imaging.Ocr.Results.OcrDocument = engineManager.Recognize(images, settings)
    
                            ' create string with the recognition results
    
                            Dim documentContent As New System.Text.StringBuilder()
                            ' for each recognized page
                            For i As Integer = 0 To ocrResults.Pages.Count - 1
                                    documentContent.AppendFormat("Page {0}", i + 1)
                                    documentContent.AppendLine()
                                    documentContent.AppendLine()
    
                                    documentContent.AppendLine(ocrResults.Pages(i).GetText())
    
                                    If i <> ocrResults.Pages.Count - 1 Then
                                            documentContent.AppendLine()
                                            documentContent.AppendLine()
                                            documentContent.AppendLine()
                                    End If
                            Next
    
                            Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
                            ' save the recognition results
                            System.IO.File.WriteAllText(textFilePath, documentContent.ToString(), System.Text.Encoding.UTF8)
    
                            ' clear and dispose images
                            images.ClearAndDisposeItems()
                    End Using
            Finally
                    ' for each additional OCR engine
                    For i As Integer = 0 To additionalEngines.Length - 1
                            If additionalEngines(i) IsNot Nothing Then
                                    ' dispose the additional OCR engine
                                    additionalEngines(i).Dispose()
                            End If
                    Next
            End Try
    End Using
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging
    // - Vintasoft.Imaging.Ocr
    // - Vintasoft.Imaging.Ocr.Tesseract
    
    string imageFilePath = @"D:\TestImage.png";
    // create the OCR engine
    using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = 
        new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
    {
        // create an array for additional OCR engines
        Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[] additionalEngines = 
            new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[2];
        try
        {
            // for each additional OCR engine
            for (int i = 0; i < additionalEngines.Length; i++)
                // create the additional OCR engine
                additionalEngines[i] = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr();
    
            // create the OCR engine manager
            Vintasoft.Imaging.Ocr.OcrEngineManager engineManager = 
                new Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines);
            // create an image collection
            using (Vintasoft.Imaging.ImageCollection images = 
                new Vintasoft.Imaging.ImageCollection())
            {
                // load the images from file
                images.Add(imageFilePath);
    
                // create the OCR engine settings
                Vintasoft.Imaging.Ocr.OcrEngineSettings settings = new Vintasoft.Imaging.Ocr.OcrEngineSettings();
    
                // recognize text in images
                Vintasoft.Imaging.Ocr.Results.OcrDocument ocrResults = engineManager.Recognize(images, settings);
    
                // create string with the recognition results
    
                System.Text.StringBuilder documentContent = new System.Text.StringBuilder();
                // for each recognized page
                for (int i = 0; i < ocrResults.Pages.Count; i++)
                {
                    documentContent.AppendFormat("Page {0}", i + 1);
                    documentContent.AppendLine();
                    documentContent.AppendLine();
    
                    documentContent.AppendLine(ocrResults.Pages[i].GetText());
    
                    if (i != ocrResults.Pages.Count - 1)
                    {
                        documentContent.AppendLine();
                        documentContent.AppendLine();
                        documentContent.AppendLine();
                    }
                }
    
                string textFilePath = System.IO.Path.Combine(
                    System.IO.Path.GetDirectoryName(imageFilePath),
                    System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
                // save the recognition results
                System.IO.File.WriteAllText(textFilePath, documentContent.ToString(), System.Text.Encoding.UTF8);
    
                // clear and dispose images
                images.ClearAndDisposeItems();
            }
        }
        finally
        {
            // for each additional OCR engine
            for (int i = 0; i < additionalEngines.Length; i++)
            {
                if (additionalEngines[i] != null)
                    // dispose the additional OCR engine
                    additionalEngines[i].Dispose();
            }
        }
    }
    


    Text recognition in very large image

    Tesseract OCR creates a copy of image data before text recognition. This may lead to a huge use of memory or even memory overflow in case the image is very large. OcrEngineManager class allows to divide the very large image into several not so large images, recognize the text in the resulting images and unite the OCR results into one total result.

    Here is an example that shows how to recognize text in a very large image (20000x30000, 24-bpp, 600 dpi):
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging
    ' - Vintasoft.Imaging.Ocr
    ' - Vintasoft.Imaging.Ocr.Tesseract
    
    Dim imageFilePath As String = "D:\TestImage.png"
    ' create the OCR engine
    Using tesseractOcr As New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
            ' create an array for additional OCR engines
            Dim additionalEngines As Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr() = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(2) {}
            Try
                    ' for each additional OCR engine
                    For i As Integer = 0 To additionalEngines.Length - 1
                            ' create the additional OCR engine
                            additionalEngines(i) = New Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr()
                    Next
    
                    ' create the OCR engine manager
                    Dim engineManager As New Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines)
    
                    ' load an image from file
                    Using image As New Vintasoft.Imaging.VintasoftImage(imageFilePath)
                            ' create the recognition region splitting settings
                            Dim splittingSettings As New Vintasoft.Imaging.Ocr.OcrRecognitionRegionSplittingSettings(5000, 5000, 2000)
                            ' specify that the recognition regions must be splitted into sub regions
                            engineManager.RecognitionRegionSplittingSettings = splittingSettings
    
                            ' create the OCR engine settings
                            Dim settings As New Vintasoft.Imaging.Ocr.OcrEngineSettings()
    
                            ' recognize text in image
                            Dim ocrResult As Vintasoft.Imaging.Ocr.Results.OcrPage = engineManager.Recognize(image, settings)
    
                            ' get the recognized text
                            Dim ocrResultAsText As String = ocrResult.GetText()
    
                            Dim textFilePath As String = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(imageFilePath), System.IO.Path.GetFileNameWithoutExtension(imageFilePath) & ".txt")
                            ' save the recognized text in file
                            System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8)
                    End Using
            Finally
                    ' for each additional OCR engine
                    For i As Integer = 0 To additionalEngines.Length - 1
                            If additionalEngines(i) IsNot Nothing Then
                                    ' dispose the additional OCR engine
                                    additionalEngines(i).Dispose()
                            End If
                    Next
            End Try
    End Using
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging
    // - Vintasoft.Imaging.Ocr
    // - Vintasoft.Imaging.Ocr.Tesseract
    
    string imageFilePath = @"D:\TestImage.png";
    // create the OCR engine
    using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr())
    {
        // create an array for additional OCR engines
        Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[] additionalEngines = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr[3];
        try
        {
            // for each additional OCR engine
            for (int i = 0; i < additionalEngines.Length; i++)
                // create the additional OCR engine
                additionalEngines[i] = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr();
    
            // create the OCR engine manager
            Vintasoft.Imaging.Ocr.OcrEngineManager engineManager = new Vintasoft.Imaging.Ocr.OcrEngineManager(tesseractOcr, additionalEngines);
    
            // load an image from file
            using (Vintasoft.Imaging.VintasoftImage image = new Vintasoft.Imaging.VintasoftImage(imageFilePath))
            {
                // create the recognition region splitting settings
                Vintasoft.Imaging.Ocr.OcrRecognitionRegionSplittingSettings splittingSettings =
                    new Vintasoft.Imaging.Ocr.OcrRecognitionRegionSplittingSettings(5000, 5000, 2000);
                // specify that the recognition regions must be splitted into sub regions
                engineManager.RecognitionRegionSplittingSettings = splittingSettings;
    
                // create the OCR engine settings
                Vintasoft.Imaging.Ocr.OcrEngineSettings settings = new Vintasoft.Imaging.Ocr.OcrEngineSettings();
    
                // recognize text in image
                Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = engineManager.Recognize(image, settings);
    
                // get the recognized text
                string ocrResultAsText = ocrResult.GetText();
    
                string textFilePath = System.IO.Path.Combine(
                    System.IO.Path.GetDirectoryName(imageFilePath),
                    System.IO.Path.GetFileNameWithoutExtension(imageFilePath) + ".txt");
                // save the recognized text in file
                System.IO.File.WriteAllText(textFilePath, ocrResultAsText, System.Text.Encoding.UTF8);
            }
        }
        finally
        {
            // for each additional OCR engine
            for (int i = 0; i < additionalEngines.Length; i++)
            {
                if (additionalEngines[i] != null)
                    // dispose the additional OCR engine
                    additionalEngines[i].Dispose();
            }
        }
    }