PDF Document TextRegion has strange structure

SebastianB · Post by **SebastianB** » Thu Sep 27, 2012 12:58 pm

Hi,

I have problem with a PDF document. The documents contains some text-elements which are formatted with a special font (i.e. Tahoma).
I am iterating over the pages and textlines to check every symbol for beeing formatted with this font and attach those characters to a StringBuilder. the idea behind is to extract information from the document for further processing.

Here is what I see in any kind of PDF Viewer (this also includes the VintaSoft Demo Applications):
&Field1:608121 &Field64:01.07.2010 &Field3:12.286,75

I am using the following code to extract the required stuff:

Code: Select all

var pdf = new PdfDocument(file);
                var sb = new StringBuilder();
                for (int iPage = 0; iPage < pdf.Pages.Count; iPage++)
                {
                    var page = pdf.Pages[iPage];
                    foreach (var textRegionLine in page.TextRegion.Lines)
                        foreach (var symbol in textRegionLine.Symbols)
                        {
                            //Compare fonts with allowed ones and add the symbol to the StringBuilder
                        }
                }
                pdf.Dispose();
                pdf.ClearCache();

An here is what I get (just a part of the output):
&Field1:608121 &Field64:01.07.2010 &Field3:12 286 75
.
,

Any suggestions? Any ideas?
Thanks,
Sebastian

Post by **Alex** » Fri Sep 28, 2012 8:23 am

Hello Sebastian,

Could you send us a demo project which demonstrates the issue? If yes, please send the project with description of the problem to support@vintasoft.com

Best regards, Alexander

VintaSoft Tech Community

PDF Document TextRegion has strange structure

PDF Document TextRegion has strange structure

Re: PDF Document TextRegion has strange structure