Preventing the text extraction from PDF document by obfuscating text encoding and fonts of the document.

Articles about functionality of VintaSoft PDF .NET Plug-in.

Moderator: Alex

Post Reply
Alex
Site Admin
Posts: 1445
Joined: Thu Jul 10, 2008 2:21 pm

Preventing the text extraction from PDF document by obfuscating text encoding and fonts of the document.

Post by Alex » Sun Oct 09, 2016 10:22 am

Starting from version 8.4 the VintaSoft Imaging .NET SDK supports the ability to obfuscate the text encoding for all font types.

What is the purpose? For example, you publish a document with open access (or give it to a third party) and you do not want the text of document to be easily extracted by copy/paste in any PDF viewer.

Solution 1: Forbid the text extraction using flags of document security settings, however the complying with the settings will depend on application, which the user uses for viewing the document. The text can be easily extracted, but just not in any PDF viewer.

Solution 2: Completely remove the information about text encoding, mix up glyphs of font, mix up symbol codes on page. This can be done using the functionality for obfuscating the text encoding provided in VintaSoft PDF .NET Plug-in. The obfuscating mechanism completely removes the information about text encoding, mixes up the glyphs of symbols in font and on page, duplicates fonts and symbol glyphs. The text obfuscated in such way may be extracted using OCR only.

Let us review the obfuscating of simple text (obfuscateTest_source.pdf) using PdfEditorDemo application.

The document contains two pages. Four sentences have been written using three different fonts.
The first sentence on the first and second page has been written using the same font.

Image

The text extraction panel and font viewing dialog show that all symbols have correct encoding.

Image

The obfuscation dialog is called using the application menu: Text -> Obfuscate Text Encoding -> Settings:

Image

Let's perform the obfuscation with the default settings and review the output document (obfuscateTest_noDuplicate_1_1.pdf).

Document review shows that the fonts are fully changed, glyphs were relocated in a random way. The text that is extracted now is displayed in the text extraction panel.

Image

As the first sentence on both pages was written by the same font, the symbols of first word (highlighted in green) on both pages are the same:

Image

You might say: "I will write a simple program and extract the text. I will manually compose a correspondence table and decode the text." Yes. This is possible, especially for case when just few fonts are presented in document. That's why we have implemented the obfuscation settings, which make the solving of decoding task almost impossible:

Let's switch on the duplication option for fonts - each page will use its own set of fonts, and glyphs of each font set will be relocated in a random way.
Let's switch on the duplication option for glyphs of font - identical symbols on page will use different glyphs of symbols in font, the code of glyph will be chosen in a random way. The output document: obfuscateTest_Duplicate_2_3.pdf.

Image

Now we can observe that for letter "e" were used different glyphs and the extracted text will include symbols of this letter with different codes.

Image

The word "Text" from the first line now is written via different fonts. In each of fonts is used its own random location and duplication of glyphs. Now the results of this word extraction from first abd second page are absolutely different:

Image

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest