Page 1 of 1

Saving Multipage Tiff and OCR

Posted: Wed Apr 24, 2013 3:48 pm
by SebastianB
Hey Alex,

maybe I am stupid or just blind, hopefully you can help.

I loaded a multipage tiff into an ImageCollection. On each page / image of the collection I am doing some OCR to split the document based on the results of the OCR operation. I.e. OCR was successful on page 2 pages 0 and 1 will be put into another ImageCollection and saved. Page 2 to next matchingpage-1 will be saved into another file and so on.

However I have the following issue:
When doing OCR on the original document it gets a value. But when executing OCR on the resulting files of saving the other ImageCollection(s) I do not get any results. My assumption is that there might be a difference between the original tiff and the new tiff. But I cannot see it. RenderingSettings of both ImageCollections are the same. I applied the following method on each (filled) ImageCollection.

Code: Select all

internal static void SetRenderingSettings(ref ImageCollection imageCollection)
{
    if (imageCollection == null)
        return;

    imageCollection.SetRenderingSettings(new RenderingSettings(300, 300, InterpolationMode.Default, SmoothingMode.Default));
}


Is there a way to save the new files with exactly the same settings, format definition etc. the original ImageCollection has?


(In this case I am not allowed to send sample files due to data protection policy and costumer security policy.)

Best,
Sebastian

Re: Saving Multipage Tiff

Posted: Thu Apr 25, 2013 8:08 am
by Yuri
Hello Sebastian,

Your code where you set rendering settings has nothing to do with raster TIFF format. It is for vector format rendering, like PDF.

For TIFF the most important things are compression and resolution. By default TIFF encoder uses lossless encoding mode and does not change the image itself.

May be you used some image pre- or post-processing before or after OCR ?

If so you can go the following way: clone the source image, perform some image processing, OCR on cloned image and save the source image.


Sincerely, Yuri

Re: Saving Multipage Tiff

Posted: Thu Apr 25, 2013 10:58 am
by SebastianB
I double checked some things:

After loading the tiff file into the ImageCollection (sourceCollection), each image has a resolution of 300dpi. So I just skipped over the SetRenderingSettings method above.
I perform directly on the images in the ImageCollection. No pre- or post processing will be performed. (I guess you are talking about things like the DocCleanUp)

OCR is done in a specified region. Most times the region exactly fits the area where the words are located, sometimes the word on the document is a little bit outside of this region. The result is: exact recognition or weird results (but results).

Now I saved my i.e. 10 page multipage tiff into some some new multipage tiffs. As explained in my previous post this will be done by checking if the OCR returned a result or not. (there are also more features like checking for exact text or contains text) If so this page becomes the first page of the new document and next pages will be added to this new document until we recognized a result in the given region. This splitting is done by just adding the image from sourceCollection to a new ImageCollection (targetCollection). targetCollection will be stored using the following code.

Small snipped how the source image becomes a target image

Code: Select all

for (int page = 0; page < sourceFileImages.Count; page++)
{
    targetFileImages.Add(sourceFileImages[page]);
                 
    if (pagesToSplit.Contains(page + 1) || page == sourceFileImages.Count - 1 )
    {
        targetFileImages.SaveSync(targetFileName, true);
        targetFileImages.ClearAndDisposeItems();
    }
}
Now there are i.e. 4 new multipage tiff files.
I run the application again. But now it does recognize just nothing. And I am wondering why. I expect to get the same recognition results. Except there are now more files (or better less pages in each file), nothing else changes.

BTW: usually the application needs to handle each image or pdf or whatever to an VintaSoft ImageCollection addible file. Those could have various dpi settings. For this reason the SetRenderingSettings needs to be called to have the same dpi for each document, otherwise predefined recognition regions would not fit any more (in 96dpi a 100px long line is longer than in 300dpi)

Re: Saving Multipage Tiff and OCR

Posted: Thu Apr 25, 2013 2:39 pm
by SebastianB
UPDATE:

I think I found the reason: The original document is uncompressed, the documents after splitting are CCIT4 compressed.

In the documentation of the ImageCollection.SaveSync(string, bool) method I found the following remark:
Suitable encoder is selected automatically from the extension of the filename, exception is thrown if encoder is not found for file extension specified in filename.
Questions:
1. Do, and if 1yes, how decides this method to choose a compression format of the new files?
2. Why does does the compression influence the OCR results?
3. How can I get detailed information from the source image, which includes the compression as well?

Re: Saving Multipage Tiff and OCR

Posted: Fri Apr 26, 2013 10:03 am
by Yuri
Hello Sebastian,

1. Auto compression is chosen the way to encode data following 3 general rules: most compact, highest performance, maximally lossless.
For TIFF the rules are as follows:
- for black-white images in the majority of cases is chosen CCITT4 compression.
- for 8/16/24 bpp images in the majority of cases is chosen LZW.

2. Do you have black-while input images?

3. For gaining the detailed information you should use more low-level class, for TIFFs it is TiffFile class. Refer to tiffFile.Pages.Compression property.

We could have taken a clear view of the situation if you had sent us source image and resulting image. Please consider that again with non-sensitive images and provide them to support@vintasoft.com.

Kind regards,
Yuri