PDF Document TextRegion has strange structure

Questions, comments and suggestions concerning VintaSoft PDF .NET Plug-in.

Moderator: Alex

Post Reply
Posts: 20
Joined: Thu Sep 27, 2012 12:39 pm

PDF Document TextRegion has strange structure

Post by SebastianB »


I have problem with a PDF document. The documents contains some text-elements which are formatted with a special font (i.e. Tahoma).
I am iterating over the pages and textlines to check every symbol for beeing formatted with this font and attach those characters to a StringBuilder. the idea behind is to extract information from the document for further processing.

Here is what I see in any kind of PDF Viewer (this also includes the VintaSoft Demo Applications):
&Field1:608121 &Field64:01.07.2010 &Field3:12.286,75

I am using the following code to extract the required stuff:

Code: Select all

var pdf = new PdfDocument(file);
                var sb = new StringBuilder();
                for (int iPage = 0; iPage < pdf.Pages.Count; iPage++)
                    var page = pdf.Pages[iPage];
                    foreach (var textRegionLine in page.TextRegion.Lines)
                        foreach (var symbol in textRegionLine.Symbols)
                            //Compare fonts with allowed ones and add the symbol to the StringBuilder
An here is what I get (just a part of the output):
&Field1:608121 &Field64:01.07.2010 &Field3:12 286 75

Any suggestions? Any ideas?

Site Admin
Posts: 1802
Joined: Thu Jul 10, 2008 2:21 pm

Re: PDF Document TextRegion has strange structure

Post by Alex »

Hello Sebastian,

Could you send us a demo project which demonstrates the issue? If yes, please send the project with description of the problem to support@vintasoft.com

Best regards, Alexander

Post Reply