Run OCR only on non searchable pages

Questions, comments and suggestions concerning VintaSoft Imaging .NET SDK.

Moderator: Alex

BenjaminA
Posts: 11
Joined: Wed Jul 24, 2019 3:33 pm

Run OCR only on non searchable pages

Post by BenjaminA »

Hi !
We want to use Vintasoft.Imaging.Ocr.Tesseract API.
We have Pdf:s that already are searchable. But in some cases some pages are images. How can we run Ocr only on pages that are not searchable?
Can you give us some sample code?
Thank you !
Alex
Site Admin
Posts: 2303
Joined: Thu Jul 10, 2008 2:21 pm

Re: Run OCR only on non searchable pages

Post by Alex »

Hi Benjamin,

The PdfPage.IsImageOnly property allows to determine that page contains only one image:
https://www.vintasoft.com/docs/vsimagin ... eOnly.html

Please check the PdfPage.IsImageOnly property for each page in PDF document and find image-only pages.


Also you can check the PdfPage.TextRegion property:
https://www.vintasoft.com/docs/vsimagin ... egion.html

Page is "image-only" if propeprty returns empty text region.

Best regards, Alexander
BenjaminA
Posts: 11
Joined: Wed Jul 24, 2019 3:33 pm

Re: Run OCR only on non searchable pages

Post by BenjaminA »

I have this PDF : https://srv-file2.gofile.io/download/xcUz0M/1.pdf
The PDF contains only image, but PdfPage.IsImageOnly = false
Way is it so ?
BenjaminA
Posts: 11
Joined: Wed Jul 24, 2019 3:33 pm

Re: Run OCR only on non searchable pages

Post by BenjaminA »

I did search a lot but could not find any sample code that showing how to create a new PDF and only do ocr on non searchable pages .
Can you please provide some sample code?
Thank you
Alex
Site Admin
Posts: 2303
Joined: Thu Jul 10, 2008 2:21 pm

Re: Run OCR only on non searchable pages

Post by Alex »

Hello Benjamin,

Here is code that shows how to recognize text in image-only pages of PDF document:

Code: Select all

public static void RecognizeNonTextPdfPages(OcrEngine ocrEngine, string inPdfFilename, string outPdfFilename)
{
    OcrEngineManager engineManager = new OcrEngineManager(ocrEngine);
    OcrEngineSettings ocrSettings = new OcrEngineSettings(OcrLanguage.English);

    PdfRenderingSettings renderingSettings = new PdfRenderingSettings();
    renderingSettings.Resolution = new Resolution(300, 300);

    // open source PDF document
    using (PdfDocument document = new PdfDocument(inPdfFilename))
    {
        // create PDF document builder
        PdfDocumentBuilder builder = new PdfDocumentBuilder(document);
        builder.Font = PdfDocumentBuilder.CreateGlyphLessFont(document);
        builder.PageCreationMode = PdfPageCreationMode.ImageOverText;

        // for each page in source PDF document
        for (int i = 0; i < document.Pages.Count; i++)
        {
            // get PDF page
            PdfPage page = document.Pages[i];
            // if page does not have text
            if (page.TextRegion.IsEmpty)
            {
                // render image of PDF page
                using (VintasoftImage image = page.Render(renderingSettings, null, null))
                {
                    // recognize text in rendered image
                    OcrPage ocrPage = engineManager.Recognize(image, ocrSettings);
                    // if page has text
                    if (ocrPage != null && !string.IsNullOrEmpty(ocrPage.GetText()))
                        // set OCR page as a background for PDF page
                        builder.SetAsBackground(i, ocrPage);
                }
            }
        }

        // save PDF document to a new file
        document.Pack(outPdfFilename);
    }
}
Best regards, Alexander
BenjaminA
Posts: 11
Joined: Wed Jul 24, 2019 3:33 pm

Re: Run OCR only on non searchable pages

Post by BenjaminA »

Hi !
Thank you !
Is it possible to remove previous text layer on pdf?
Alex
Site Admin
Posts: 2303
Joined: Thu Jul 10, 2008 2:21 pm

Re: Run OCR only on non searchable pages

Post by Alex »

Hello Benjamin,
BenjaminA wrote: Tue Jul 30, 2019 11:11 am Is it possible to remove previous text layer on pdf?
Yes, this is possible. Please use the PdfPage.RemoveText() method for removing text from PDF page:
https://www.vintasoft.com/docs/vsimagin ... eText.html

Best regards, Alexander
BenjaminA
Posts: 11
Joined: Wed Jul 24, 2019 3:33 pm

Re: Run OCR only on non searchable pages

Post by BenjaminA »

Hi !
I used PdfPage.RemoveText() to remove text. The problem is that the image is also removed for som PDF.

We have two kind of PDFs. One that are already searchable and the other that has been OCR (image+text layer).
Alex
Site Admin
Posts: 2303
Joined: Thu Jul 10, 2008 2:21 pm

Re: Run OCR only on non searchable pages

Post by Alex »

Hi Benjamin,
BenjaminA wrote: Fri Aug 23, 2019 3:55 pm I used PdfPage.RemoveText() to remove text. The problem is that the image is also removed for som PDF.

We have two kind of PDFs. One that are already searchable and the other that has been OCR (image+text layer).
The PdfPage.RemoveText method removes only text and does not remove images.

For understanding your problem we need to reproduce the problem on our side. Please send us (to support@vintasoft.com) small project, which allows to reproduce the problem.

Best regards, Alexander
BenjaminA
Posts: 11
Joined: Wed Jul 24, 2019 3:33 pm

Re: Run OCR only on non searchable pages

Post by BenjaminA »

The PdfPage.RemoveText method removes only text and does not remove images.
Yes I think you are right.
In some cases we do not have a text + image layer. We have only text layer.
When we run OCR on such PDFs, we get two text on PDF.

So I must first convert page to Image then run OCR.

How can I convert a PDF text page to image?
Post Reply