Run OCR only on non searchable pages
Moderator: Alex
Run OCR only on non searchable pages
Hi !
We want to use Vintasoft.Imaging.Ocr.Tesseract API.
We have Pdf:s that already are searchable. But in some cases some pages are images. How can we run Ocr only on pages that are not searchable?
Can you give us some sample code?
Thank you !
We want to use Vintasoft.Imaging.Ocr.Tesseract API.
We have Pdf:s that already are searchable. But in some cases some pages are images. How can we run Ocr only on pages that are not searchable?
Can you give us some sample code?
Thank you !
Re: Run OCR only on non searchable pages
Hi Benjamin,
The PdfPage.IsImageOnly property allows to determine that page contains only one image:
https://www.vintasoft.com/docs/vsimagin ... eOnly.html
Please check the PdfPage.IsImageOnly property for each page in PDF document and find image-only pages.
Also you can check the PdfPage.TextRegion property:
https://www.vintasoft.com/docs/vsimagin ... egion.html
Page is "image-only" if propeprty returns empty text region.
Best regards, Alexander
The PdfPage.IsImageOnly property allows to determine that page contains only one image:
https://www.vintasoft.com/docs/vsimagin ... eOnly.html
Please check the PdfPage.IsImageOnly property for each page in PDF document and find image-only pages.
Also you can check the PdfPage.TextRegion property:
https://www.vintasoft.com/docs/vsimagin ... egion.html
Page is "image-only" if propeprty returns empty text region.
Best regards, Alexander
Re: Run OCR only on non searchable pages
I have this PDF : https://srv-file2.gofile.io/download/xcUz0M/1.pdf
The PDF contains only image, but PdfPage.IsImageOnly = false
Way is it so ?
The PDF contains only image, but PdfPage.IsImageOnly = false
Way is it so ?
Re: Run OCR only on non searchable pages
I did search a lot but could not find any sample code that showing how to create a new PDF and only do ocr on non searchable pages .
Can you please provide some sample code?
Thank you
Can you please provide some sample code?
Thank you
Re: Run OCR only on non searchable pages
Hello Benjamin,
Here is code that shows how to recognize text in image-only pages of PDF document:
Best regards, Alexander
Here is code that shows how to recognize text in image-only pages of PDF document:
Code: Select all
public static void RecognizeNonTextPdfPages(OcrEngine ocrEngine, string inPdfFilename, string outPdfFilename)
{
OcrEngineManager engineManager = new OcrEngineManager(ocrEngine);
OcrEngineSettings ocrSettings = new OcrEngineSettings(OcrLanguage.English);
PdfRenderingSettings renderingSettings = new PdfRenderingSettings();
renderingSettings.Resolution = new Resolution(300, 300);
// open source PDF document
using (PdfDocument document = new PdfDocument(inPdfFilename))
{
// create PDF document builder
PdfDocumentBuilder builder = new PdfDocumentBuilder(document);
builder.Font = PdfDocumentBuilder.CreateGlyphLessFont(document);
builder.PageCreationMode = PdfPageCreationMode.ImageOverText;
// for each page in source PDF document
for (int i = 0; i < document.Pages.Count; i++)
{
// get PDF page
PdfPage page = document.Pages[i];
// if page does not have text
if (page.TextRegion.IsEmpty)
{
// render image of PDF page
using (VintasoftImage image = page.Render(renderingSettings, null, null))
{
// recognize text in rendered image
OcrPage ocrPage = engineManager.Recognize(image, ocrSettings);
// if page has text
if (ocrPage != null && !string.IsNullOrEmpty(ocrPage.GetText()))
// set OCR page as a background for PDF page
builder.SetAsBackground(i, ocrPage);
}
}
}
// save PDF document to a new file
document.Pack(outPdfFilename);
}
}
Re: Run OCR only on non searchable pages
Hi !
Thank you !
Is it possible to remove previous text layer on pdf?
Thank you !
Is it possible to remove previous text layer on pdf?
Re: Run OCR only on non searchable pages
Hello Benjamin,
https://www.vintasoft.com/docs/vsimagin ... eText.html
Best regards, Alexander
Yes, this is possible. Please use the PdfPage.RemoveText() method for removing text from PDF page:
https://www.vintasoft.com/docs/vsimagin ... eText.html
Best regards, Alexander
Re: Run OCR only on non searchable pages
Hi !
I used PdfPage.RemoveText() to remove text. The problem is that the image is also removed for som PDF.
We have two kind of PDFs. One that are already searchable and the other that has been OCR (image+text layer).
I used PdfPage.RemoveText() to remove text. The problem is that the image is also removed for som PDF.
We have two kind of PDFs. One that are already searchable and the other that has been OCR (image+text layer).
Re: Run OCR only on non searchable pages
Hi Benjamin,
For understanding your problem we need to reproduce the problem on our side. Please send us (to support@vintasoft.com) small project, which allows to reproduce the problem.
Best regards, Alexander
The PdfPage.RemoveText method removes only text and does not remove images.
For understanding your problem we need to reproduce the problem on our side. Please send us (to support@vintasoft.com) small project, which allows to reproduce the problem.
Best regards, Alexander
Re: Run OCR only on non searchable pages
Yes I think you are right.The PdfPage.RemoveText method removes only text and does not remove images.
In some cases we do not have a text + image layer. We have only text layer.
When we run OCR on such PDFs, we get two text on PDF.
So I must first convert page to Image then run OCR.
How can I convert a PDF text page to image?