VintaSoft Imaging .NET SDK v8.6
In This Topic
    PDF: Search and extract text in PDF document, highlight text of PDF document
    In This Topic

    PdfTextRegion class is intended for searching and extracting a text from the whole PDF page or from a region of PDF page. A text region which represents a whole PDF page can be obtained using PdfPage.TextRegion property. A text region which represents some region of PDF page can be obtained using PdfTextRegion.GetSubregion method.


    IMPORTANT! All coordinates, which define text location on PDF page, are specified in the coordinate system of PDF page. All sizes, which define sizes of text regions, are specified in units of measure of PDF page. Information about coordinate system and measurement units of PDF page is available here.


    Text search

    PdfTextRegion class allows to:

    Here is an example that demonstrates how to find a text in a PDF page:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging.Pdf
    
    Public Shared Function FindTextOnPdfPage(document As Vintasoft.Imaging.Pdf.PdfDocument, pageIndex As Integer, text As String) As Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion
        ' non-case sensitive text must be searched
        Dim searchEngine As Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine = Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine.Create(text, True)
        ' find text
        Dim startIndex As Integer = 0
        Return document.Pages(pageIndex).TextRegion.FindText(text, startIndex, False)
    End Function
                  
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging.Pdf
    
    public static Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion FindTextOnPdfPage(
        Vintasoft.Imaging.Pdf.PdfDocument document, int pageIndex, string text)
    {
        // non-case sensitive text must be searched
        Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine searchEngine = 
            Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine.Create(text, true);
        // find text
        int startIndex = 0;
        return document.Pages[pageIndex].TextRegion.FindText(text, ref startIndex, false);
    }
                    
    


    Here is an example that demonstrates how to search for text in PDF document using a regular expression:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging.Pdf
    
    ''' <summary>
    ''' Outputs the information about digits in content of PDF document.
    ''' </summary>
    ''' <param name="document">PDF document where digits should be searched.</param>
    Public Sub SearchDigitsInTextOfPdfDocument(document As Vintasoft.Imaging.Pdf.PdfDocument)
        System.Console.WriteLine("Searching the digits in text of PDF document is started.")
    
        For i As Integer = 0 To document.Pages.Count - 1
            Dim textRegions As Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion() = SimpleDigitsSearchOnPdfPage(document.Pages(i), New System.Text.RegularExpressions.Regex("\d+"))
            If textRegions IsNot Nothing Then
                For j As Integer = 0 To textRegions.Length - 1
                    System.Console.WriteLine(String.Format("- Text={0}, Rectangle={1}", textRegions(j).TextContent, textRegions(j).Rectangle))
                Next
            End If
        Next
    
        System.Console.WriteLine("Searching the digits in text of PDF document is finished.")
    End Sub
    
    ''' <summary>
    ''' Searches a text, defined with regular expression, on PDF page.
    ''' </summary>
    ''' <param name="page">PDF page where text should be searched.</param>
    ''' <param name="regex">Regular expression which defines the searching text.</param>
    ''' <returns>An array of text regions on PDF page where text was found.</returns>
    Public Function SimpleDigitsSearchOnPdfPage(page As Vintasoft.Imaging.Pdf.Tree.PdfPage, regex As System.Text.RegularExpressions.Regex) As Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion()
        Dim textRegions As New System.Collections.Generic.List(Of Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion)()
        Dim textSearchEngine As Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine = Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine.Create(regex)
    
        Dim textRegion As Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion = Nothing
        Dim startIndex As Integer = 0
        Do
            ' search text
            textRegion = page.TextRegion.FindText(textSearchEngine, startIndex, False)
            ' if found text is not empty
            If textRegion IsNot Nothing Then
                ' add result
                textRegions.Add(textRegion)
                ' shitf start index
                startIndex += textRegion.TextContent.Length
    
            End If
        Loop While textRegion IsNot Nothing
    
        Return textRegions.ToArray()
    End Function
                  
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging.Pdf
    
    /// <summary>
    /// Outputs the information about digits in content of PDF document.
    /// </summary>
    /// <param name="document">PDF document where digits should be searched.</param>
    public void SearchDigitsInTextOfPdfDocument(Vintasoft.Imaging.Pdf.PdfDocument document)
    {
        System.Console.WriteLine("Searching the digits in text of PDF document is started.");
    
        for (int i = 0; i < document.Pages.Count; i++)
        {
            Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion[] textRegions = 
                SimpleDigitsSearchOnPdfPage(document.Pages[i], new System.Text.RegularExpressions.Regex(@"\d+"));
            if (textRegions != null)
            {
                for (int j = 0; j < textRegions.Length; j++)
                {
                    System.Console.WriteLine(string.Format("- Text={0}, Rectangle={1}",
                        textRegions[j].TextContent,
                        textRegions[j].Rectangle));
                }
            }
        }
    
        System.Console.WriteLine("Searching the digits in text of PDF document is finished.");
    }
    
    /// <summary>
    /// Searches a text, defined with regular expression, on PDF page.
    /// </summary>
    /// <param name="page">PDF page where text should be searched.</param>
    /// <param name="regex">Regular expression which defines the searching text.</param>
    /// <returns>An array of text regions on PDF page where text was found.</returns>
    public Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion[] SimpleDigitsSearchOnPdfPage(
        Vintasoft.Imaging.Pdf.Tree.PdfPage page, 
        System.Text.RegularExpressions.Regex regex)
    {
        System.Collections.Generic.ListPdfTextRegion> textRegions = 
            new System.Collections.Generic.ListPdfTextRegion>();
        Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine textSearchEngine = 
            Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine.Create(regex);
    
        Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion textRegion = null;
        int startIndex = 0;
        do
        {
            // search text
            textRegion = page.TextRegion.FindText(textSearchEngine, ref startIndex, false);
            // if found text is not empty
            if (textRegion != null)
            {
                // add result
                textRegions.Add(textRegion);
                // shitf start index
                startIndex += textRegion.TextContent.Length;
            }
    
        } while (textRegion != null);
    
        return textRegions.ToArray();
    }
                    
    


    Here is an example that demonstrates how to search for text in PDF document using user-defined algorithm of text search:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging.Pdf
    
    ''' <summary>
    ''' Outputs the information about digits in content of PDF document.
    ''' </summary>
    ''' <param name="document">PDF document where digits should be searched.</param>
    Public Sub SearchDigitsInTextOfPdfDocumentUsingTextSearchEngine(document As Vintasoft.Imaging.Pdf.PdfDocument)
        System.Console.WriteLine("Searching the digits in text of PDF document.")
    
        For i As Integer = 0 To document.Pages.Count - 1
            Dim textRegions As Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion() = AdvancedDigitsSearchOnPdfPage(document.Pages(i))
            If textRegions IsNot Nothing Then
                For j As Integer = 0 To textRegions.Length - 1
                    System.Console.WriteLine(String.Format("- Text={0}, Rectangle={1}", textRegions(j).TextContent, textRegions(j).Rectangle))
                Next
            End If
        Next
    
        System.Console.WriteLine("Searching the digits in text of PDF document is finished.")
    End Sub
    
    ''' <summary>
    ''' Searches digits on PDF page.
    ''' </summary>
    ''' <param name="page">PDF page where digits should be searched.</param>
    ''' <returns>An array of text regions on PDF page where text was found.</returns>
    Public Function AdvancedDigitsSearchOnPdfPage(page As Vintasoft.Imaging.Pdf.Tree.PdfPage) As Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion()
        Dim textRegions As New System.Collections.Generic.List(Of Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion)()
        Dim digitsSearchEngine As New DigitsSearchEngine()
    
        Dim textRegion As Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion = Nothing
        Dim startIndex As Integer = 0
        Do
            ' search text
            textRegion = page.TextRegion.FindText(digitsSearchEngine, startIndex, False)
            If textRegion IsNot Nothing Then
                ' add result
                textRegions.Add(textRegion)
                ' shitf start index
                startIndex += textRegion.TextContent.Length
    
            End If
        Loop While textRegion IsNot Nothing
    
        Return textRegions.ToArray()
    End Function
    
    ''' <summary>
    ''' Class for searching the digits in text of PDF page.
    ''' </summary>
    Private Class DigitsSearchEngine
        Inherits Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine
    
        ''' <summary>
        ''' Searches the first text matching in the string of PDF page.
        ''' </summary>
        ''' <param name="sourceString">Source string (string of PDF page) where text must be searched.</param>
        ''' <param name="startIndex">The zero-based index, in the sourceString, from which text must be searched.</param>
        ''' <param name="length">The number of characters, in the sourceString, to analyze.</param>
        ''' <param name="rightToLeft">Indicates that text should be searched from right to left.</param>
        ''' <returns>
        ''' Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchResult object that
        ''' contains information about searched text if text is found; otherwise, null.
        ''' </returns>
        Public Overrides Function Find(sourceString As String, startIndex As Integer, length As Integer, rightToLeft As Boolean) As Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchResult
            Dim startDigitIndex As Integer = -1
            Dim endDigitIndex As Integer = -1
            Dim start As Integer = 0
            Dim [end] As Integer = 0
    
            ' if searching text from the right to the left
            If rightToLeft Then
                start = startIndex + length
                [end] = 0
                For index As Integer = start - 1 To [end] Step -1
                    If Char.IsDigit(sourceString(index)) AndAlso endDigitIndex = -1 Then
                        endDigitIndex = index + 1
                    ElseIf Not Char.IsDigit(sourceString(index)) AndAlso endDigitIndex <> -1 Then
                        startDigitIndex = index + 1
                        Exit For
                    End If
                Next
                If endDigitIndex <> -1 AndAlso startDigitIndex = -1 Then
                    startDigitIndex = 0
                End If
            Else
                ' if searching text from the left to the right
                start = startIndex
                [end] = startIndex + length
                For index As Integer = start To [end] - 1
                    If Char.IsDigit(sourceString(index)) AndAlso startDigitIndex = -1 Then
                        startDigitIndex = index
                    ElseIf Not Char.IsDigit(sourceString(index)) AndAlso startDigitIndex <> -1 Then
                        endDigitIndex = index
                        Exit For
                    End If
                Next
                If startDigitIndex <> -1 AndAlso endDigitIndex = -1 Then
                    endDigitIndex = [end]
                End If
            End If
    
            ' if digit is not found
            If startDigitIndex = -1 Then
                Return Nothing
            End If
    
            ' return the text search result
            Return New Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchResult(startDigitIndex, endDigitIndex - startDigitIndex)
        End Function
    End Class
                  
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging.Pdf
    
    /// <summary>
    /// Outputs the information about digits in content of PDF document.
    /// </summary>
    /// <param name="document">PDF document where digits should be searched.</param>
    public void SearchDigitsInTextOfPdfDocumentUsingTextSearchEngine(Vintasoft.Imaging.Pdf.PdfDocument document)
    {
        System.Console.WriteLine("Searching the digits in text of PDF document.");
    
        for (int i = 0; i < document.Pages.Count; i++)
        {
            Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion[] textRegions = 
                AdvancedDigitsSearchOnPdfPage(document.Pages[i]);
            if (textRegions != null)
            {
                for (int j = 0; j < textRegions.Length; j++)
                {
                    System.Console.WriteLine(string.Format("- Text={0}, Rectangle={1}",
                        textRegions[j].TextContent,
                        textRegions[j].Rectangle));
                }
            }
        }
    
        System.Console.WriteLine("Searching the digits in text of PDF document is finished.");
    }
    
    /// <summary>
    /// Searches digits on PDF page.
    /// </summary>
    /// <param name="page">PDF page where digits should be searched.</param>
    /// <returns>An array of text regions on PDF page where text was found.</returns>
    public Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion[] AdvancedDigitsSearchOnPdfPage(
        Vintasoft.Imaging.Pdf.Tree.PdfPage page)
    {
        System.Collections.Generic.ListPdfTextRegion> textRegions = 
            new System.Collections.Generic.ListPdfTextRegion>();
        DigitsSearchEngine digitsSearchEngine = new DigitsSearchEngine();
    
        Vintasoft.Imaging.Pdf.Content.TextExtraction.PdfTextRegion textRegion = null;
        int startIndex = 0;
        do
        {
            // search text
            textRegion = page.TextRegion.FindText(digitsSearchEngine, ref startIndex, false);
            if (textRegion != null)
            {
                // add result
                textRegions.Add(textRegion);
                // shitf start index
                startIndex += textRegion.TextContent.Length;
            }
    
        } while (textRegion != null);
    
        return textRegions.ToArray();
    }
    
    /// <summary>
    /// Class for searching the digits in text of PDF page.
    /// </summary>
    class DigitsSearchEngine : Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchEngine
    {
    
        /// <summary>
        /// Searches the first text matching in the string of PDF page.
        /// </summary>
        /// <param name="sourceString">Source string (string of PDF page) where text must be searched.</param>
        /// <param name="startIndex">The zero-based index, in the sourceString, from which text must be searched.</param>
        /// <param name="length">The number of characters, in the sourceString, to analyze.</param>
        /// <param name="rightToLeft">Indicates that text should be searched from right to left.</param>
        /// <returns>
        /// Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchResult object that
        /// contains information about searched text if text is found; otherwise, null.
        /// </returns>
        public override Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchResult Find(
            string sourceString, int startIndex, int length, bool rightToLeft)
        {
            int startDigitIndex = -1;
            int endDigitIndex = -1;
            int start = 0;
            int end = 0;
    
            // if searching text from the right to the left
            if (rightToLeft)
            {
                start = startIndex + length;
                end = 0;
                for (int index = start - 1; index >= end; index--)
                {
                    if (char.IsDigit(sourceString[index]) && endDigitIndex == -1)
                        endDigitIndex = index + 1;
                    else if (!char.IsDigit(sourceString[index]) && endDigitIndex != -1)
                    {
                        startDigitIndex = index + 1;
                        break;
                    }
                }
                if (endDigitIndex != -1 && startDigitIndex == -1)
                    startDigitIndex = 0;
            }
            // if searching text from the left to the right
            else
            {
                start = startIndex;
                end = startIndex + length;
                for (int index = start; index < end; index++)
                {
                    if (char.IsDigit(sourceString[index]) && startDigitIndex == -1)
                        startDigitIndex = index;
                    else if (!char.IsDigit(sourceString[index]) && startDigitIndex != -1)
                    {
                        endDigitIndex = index;
                        break;
                    }
                }
                if (startDigitIndex != -1 && endDigitIndex == -1)
                    endDigitIndex = end;
            }
    
            // if digit is not found
            if (startDigitIndex == -1)
                return null;
    
            // return the text search result
            return new Vintasoft.Imaging.Pdf.Content.TextExtraction.TextSearchResult(
                startDigitIndex, endDigitIndex - startDigitIndex);
        }
    }
                    
    



    Text extraction

    PdfTextRegion class allows to extract:

    While extracting text from a page region is necessary to specify how the text must be extracted. SDK allows to extract text:
    By default the text is extracted by full lines.

    Here is an example that demonstrates how to extract all text from the whole PDF page:
    ' The project, which uses this code, must have references to the following assemblies:
    ' - Vintasoft.Imaging.Pdf
    
    Public Shared Function ExtractTextFromPdfPage(document As Vintasoft.Imaging.Pdf.PdfDocument, pageIndex As Integer) As String
        Return document.Pages(pageIndex).TextRegion.TextContent
    End Function
                  
    
    // The project, which uses this code, must have references to the following assemblies:
    // - Vintasoft.Imaging.Pdf
    
    public static string ExtractTextFromPdfPage(Vintasoft.Imaging.Pdf.PdfDocument document, int pageIndex)
    {
        return document.Pages[pageIndex].TextRegion.TextContent;
    }
                    
    


    Also the PdfTextRegion class allows to extract text from PDF page as a tree structure, i.e. it is possible to obtain a region representing all text of whole page - PdfPage.TextRegion, then all text lines - PdfTextRegion.Lines, then all symbols of the text line - PdfTextRegionLine.Symbols.