Run OCR only on non searchable pages

BenjaminA · Post by **BenjaminA** » Wed Jul 24, 2019 3:39 pm

Hi !
We want to use Vintasoft.Imaging.Ocr.Tesseract API.
We have Pdf:s that already are searchable. But in some cases some pages are images. How can we run Ocr only on pages that are not searchable?
Can you give us some sample code?
Thank you !

Post by **Alex** » Thu Jul 25, 2019 9:05 am

Hi Benjamin,

The PdfPage.IsImageOnly property allows to determine that page contains only one image:
https://www.vintasoft.com/docs/vsimagin ... eOnly.html

Please check the PdfPage.IsImageOnly property for each page in PDF document and find image-only pages.

Also you can check the PdfPage.TextRegion property:
https://www.vintasoft.com/docs/vsimagin ... egion.html

Page is "image-only" if propeprty returns empty text region.

Best regards, Alexander

BenjaminA · Post by **BenjaminA** » Thu Jul 25, 2019 3:08 pm

I have this PDF : https://srv-file2.gofile.io/download/xcUz0M/1.pdf
The PDF contains only image, but PdfPage.IsImageOnly = false
Way is it so ?

BenjaminA · Post by **BenjaminA** » Thu Jul 25, 2019 3:22 pm

I did search a lot but could not find any sample code that showing how to create a new PDF and only do ocr on non searchable pages .
Can you please provide some sample code?
Thank you

Post by **Alex** » Thu Jul 25, 2019 5:43 pm

Hello Benjamin,

Here is code that shows how to recognize text in image-only pages of PDF document:

Code: Select all

public static void RecognizeNonTextPdfPages(OcrEngine ocrEngine, string inPdfFilename, string outPdfFilename)
{
    OcrEngineManager engineManager = new OcrEngineManager(ocrEngine);
    OcrEngineSettings ocrSettings = new OcrEngineSettings(OcrLanguage.English);

    PdfRenderingSettings renderingSettings = new PdfRenderingSettings();
    renderingSettings.Resolution = new Resolution(300, 300);

    // open source PDF document
    using (PdfDocument document = new PdfDocument(inPdfFilename))
    {
        // create PDF document builder
        PdfDocumentBuilder builder = new PdfDocumentBuilder(document);
        builder.Font = PdfDocumentBuilder.CreateGlyphLessFont(document);
        builder.PageCreationMode = PdfPageCreationMode.ImageOverText;

        // for each page in source PDF document
        for (int i = 0; i < document.Pages.Count; i++)
        {
            // get PDF page
            PdfPage page = document.Pages[i];
            // if page does not have text
            if (page.TextRegion.IsEmpty)
            {
                // render image of PDF page
                using (VintasoftImage image = page.Render(renderingSettings, null, null))
                {
                    // recognize text in rendered image
                    OcrPage ocrPage = engineManager.Recognize(image, ocrSettings);
                    // if page has text
                    if (ocrPage != null && !string.IsNullOrEmpty(ocrPage.GetText()))
                        // set OCR page as a background for PDF page
                        builder.SetAsBackground(i, ocrPage);
                }
            }
        }

        // save PDF document to a new file
        document.Pack(outPdfFilename);
    }
}

Best regards, Alexander

BenjaminA · Post by **BenjaminA** » Tue Jul 30, 2019 11:11 am

Hi !
Thank you !
Is it possible to remove previous text layer on pdf?

Post by **Alex** » Tue Jul 30, 2019 5:12 pm

Hello Benjamin,

BenjaminA wrote: Tue Jul 30, 2019 11:11 am Is it possible to remove previous text layer on pdf?

Yes, this is possible. Please use the PdfPage.RemoveText() method for removing text from PDF page:
https://www.vintasoft.com/docs/vsimagin ... eText.html

Best regards, Alexander

BenjaminA · Post by **BenjaminA** » Fri Aug 23, 2019 3:55 pm

Hi !
I used PdfPage.RemoveText() to remove text. The problem is that the image is also removed for som PDF.

We have two kind of PDFs. One that are already searchable and the other that has been OCR (image+text layer).

Post by **Alex** » Fri Aug 23, 2019 5:36 pm

Hi Benjamin,

BenjaminA wrote: Fri Aug 23, 2019 3:55 pm I used PdfPage.RemoveText() to remove text. The problem is that the image is also removed for som PDF.

We have two kind of PDFs. One that are already searchable and the other that has been OCR (image+text layer).

The PdfPage.RemoveText method removes only text and does not remove images.

For understanding your problem we need to reproduce the problem on our side. Please send us (to support@vintasoft.com) small project, which allows to reproduce the problem.

Best regards, Alexander

BenjaminA · Post by **BenjaminA** » Fri Aug 23, 2019 5:51 pm

The PdfPage.RemoveText method removes only text and does not remove images.

Yes I think you are right.
In some cases we do not have a text + image layer. We have only text layer.
When we run OCR on such PDFs, we get two text on PDF.

So I must first convert page to Image then run OCR.

How can I convert a PDF text page to image?

VintaSoft Tech Community

Run OCR only on non searchable pages

Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages

Re: Run OCR only on non searchable pages