Recognize text in image using a .NET application for Linux

Blog category: ImagingOCR.NETLinux

December 23, 2022

This article explains how to create a console .NET application and recognize text from images in Ubuntu. For text recognition from images are used VintaSoft Imaging .NET SDK and its PDF, OCR and Document Cleanup Plug-ins.

Here are the steps to complete that task:
  1. Open Ubuntu desktop.

  2. Create a folder, which will store the files of .NET application. Let us create "Recognize_Text_In_Image" folder on current user's desktop and proceed to the folder.


  3. Open the console command terminal. This can be done choosing "Open in Terminal" item in context menu or pressing the key combination Ctrl+Alt+T.


  4. Call the command in terminal, which creates a project of new console .NET application:
    dotnet new console --framework net6.0
    



    The created project contains the project file "Recognize_Text_In_Image.csproj" and "Program.cs" file, which contains C# code of application. Close terminal.

  5. Open the project file "Recognize_Text_In_Image.csproj" in text editor and change the file text to the following text:
    <Project Sdk="Microsoft.NET.Sdk">
    
      <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net6.0</TargetFramework>
        <RootNamespace>ConsoleApp1</RootNamespace>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
      </PropertyGroup>
    
      <ItemGroup>
        <PackageReference Include="SkiaSharp" Version="2.88.0" />
        <PackageReference Include="SkiaSharp.NativeAssets.Linux" Version="2.88.0" />
        <PackageReference Include="Vintasoft.Imaging" Version="12.1.5.1" />
        <PackageReference Include="Vintasoft.Imaging.Drawing.SkiaSharp" Version="12.1.5.1" />
        <PackageReference Include="Vintasoft.Imaging.DocCleanup" Version="7.1.5.1" />
        <PackageReference Include="Vintasoft.Imaging.Ocr" Version="7.1.5.1" />
        <PackageReference Include="Vintasoft.Imaging.Ocr.Tesseract" Version="7.1.5.1" />
        <PackageReference Include="Vintasoft.Imaging.Pdf" Version="9.1.5.1" />
        <PackageReference Include="Vintasoft.Imaging.Pdf.Ocr" Version="9.1.5.1" />
        <PackageReference Include="Vintasoft.Shared" Version="3.3.1.1" />
      </ItemGroup>
    
      <ItemGroup>
        <Content Include="OCR.tif">
          <CopyToOutputDirectory>Always</CopyToOutputDirectory>
        </Content>
      </ItemGroup>
    
    </Project>
    



    The changed project references nuget-packages for VintaSoft Imaging .NET SDK (Vintasoft.Shared.dll, Vintasoft.Imaging.dll, Vintasoft.Imaging.Drawing.SkiaSharp.dll), VintaSoft Document Cleanup .NET Plug-in (Vintasoft.Imaging.DocCleanup.dll), VintaSoft OCR .NET Plug-in (Vintasoft.Imaging.Ocr.dll, Vintasoft.Imaging.Ocr.Tesseract.dll) and VintaSoft PDF .NET Plug-in (Vintasoft.Imaging.Pdf, Vintasoft.Imaging.Pdf.Ocr).

  6. Open the file "Program.cs" and change its code to the following C# code:
    namespace ConsoleApp1
    {
        class Program
        {
            static void Main(string[] args)
            {
                Vintasoft.Imaging.ImagingGlobalSettings.Register("%EVAL_LIC_USER_NAME%", "%EVAL_LIC_USER_EMAIL%", "%EVAL_LIC_DATE%", "%EVAL_LIC_REG_CODE%");
    
                string imageFilePath = "OCR.tif";
    
                string tesseractOcrPath = "TesseractOCR";
                // create the OCR engine
                using (Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr tesseractOcr = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcr(tesseractOcrPath))
                {
                    // specify that OCR engine will recognize English text
                    Vintasoft.Imaging.Ocr.OcrLanguage language = Vintasoft.Imaging.Ocr.OcrLanguage.English;
                    // create the OCR engine settings
                    Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings settings = new Vintasoft.Imaging.Ocr.Tesseract.TesseractOcrSettings(language);
                    // initialize the OCR engine
                    tesseractOcr.Init(settings);
    
                    // load an image with text
                    using (Vintasoft.Imaging.VintasoftImage image = new Vintasoft.Imaging.VintasoftImage(imageFilePath))
                    {
                        // preprocess image before text recognition
    
                        // remove noise from image
                        Vintasoft.Imaging.ImageProcessing.Document.DespeckleCommand despeckleCommand = new Vintasoft.Imaging.ImageProcessing.Document.DespeckleCommand();
                        despeckleCommand.ExecuteInPlace(image);
                        // remove lines from image
                        Vintasoft.Imaging.ImageProcessing.Document.LineRemovalCommand lineRemovalCommand = new Vintasoft.Imaging.ImageProcessing.Document.LineRemovalCommand();
                        lineRemovalCommand.ExecuteInPlace(image);
    
                        // specify an image with text
                        tesseractOcr.SetImage(image);
    
                        // recognize text in image
                        Vintasoft.Imaging.Ocr.Results.OcrPage ocrResult = tesseractOcr.Recognize();
    
                        // create PDF document
                        using (Vintasoft.Imaging.Pdf.PdfDocument pdfDocument = new Vintasoft.Imaging.Pdf.PdfDocument("OCR.pdf", Vintasoft.Imaging.Pdf.PdfFormat.Pdf_14))
                        {
                            // create PDF document builder
                            Vintasoft.Imaging.Pdf.Ocr.PdfDocumentBuilder documentBuilder = new Vintasoft.Imaging.Pdf.Ocr.PdfDocumentBuilder(pdfDocument);
                            documentBuilder.ImageCompression = Vintasoft.Imaging.Pdf.PdfCompression.Auto;
                            documentBuilder.PageCreationMode = Vintasoft.Imaging.Pdf.Ocr.PdfPageCreationMode.ImageOverText;
    
                            // add OCR result to the PDF document
                            documentBuilder.AddPage(image, ocrResult);
    
                            // save changes in PDF document
                            pdfDocument.SaveChanges();
                        }
    
                        // clear the image
                        tesseractOcr.ClearImage();
                    }
                    // shutdown the OCR engine
                    tesseractOcr.Shutdown();
                }
            }
        }
    }
    



    The application code will recognize text from image and save the result to a searchable PDF document.

  7. Get the code for using evaluation version in Linux using the way described in documentation and insert the obtained code into C# code of "Program.cs" file.


  8. Copy "OCR.tif" file to the project folder.


    You can use any other file containing a document image instead of "OCR.tif" file.

  9. Open terminal and compile the .NET project using the following command:
    dotnet build Recognize_Text_In_Image.csproj
    



    Close the terminal.

  10. Go to "bin/Debug/net6.0/" folder.


  11. Open the terminal and run the .NET application using the following command:
    dotnet ./Recognize_Text_In_Image.dll
    



    Close the terminal.

  12. Open the created PDF document and see the results: