Prevent text extraction from PDF document by obfuscating the text

VintaSoft Imaging .NET SDK supports the ability to obfuscate the text encoding for all font types.

What is the purpose? For example, you publish a document with open access (or give it to a third party) and you do not want the text of document to be easily extracted by copy/paste in any PDF viewer.

Solution 1: Forbid the text extraction using flags of PDF document security settings, however the complying with the settings will depend on application, which the user uses for viewing the document. The text can be easily extracted, if PDF viewer ignores PDF document security settings.

Solution 2: Completely remove the information about text encoding, mix up glyphs of font, mix up symbol codes on page. This can be done using the functionality for obfuscating the text encoding provided in VintaSoft PDF .NET Plug-in. The obfuscating mechanism completely removes the information about text encoding, mixes up the glyphs of symbols in font and on page, duplicates fonts and symbol glyphs. The text obfuscated in such way may be extracted using OCR only.

Let us review the obfuscating of simple PDF document (document-without-text-obfuscation.pdf) using VintaSoft PDF Editor Demo application.

The document contains two pages. Four sentences have been written using three different fonts.

The first sentence on the first and second page has been written using the same font.
Text before text encoding obfuscation in PDF document

Text before text encoding obfuscation in PDF document

The text extraction panel and font viewing dialog show that all symbols have correct text encoding.
Text characters before text encoding obfuscation in PDF document

Text characters before text encoding obfuscation in PDF document

The text encoding obfuscation dialog can be called using the application menu: Text -> Obfuscate Text Encoding -> Settings:
Standard settings for text encoding obfuscation in PDF document

Standard settings for text encoding obfuscation in PDF document

Let's perform the obfuscation with the default settings and review the output document with obfuscated text: document-with-default-text-obfuscation.pdf.

Document review shows that the fonts are fully changed, glyphs were relocated in a random way. The text, which is extracted, can be seen in the text extraction panel.
Text after text encoding obfuscation in PDF document

Text after text encoding obfuscation in PDF document

As the first sentence on both pages was written by the same font, the symbols of first word (highlighted in green) on both pages are the same:
The same word is written using one font after text encoding obfuscation in PDF document

The same word is written using one font after text encoding obfuscation in PDF document

You might say: "I will write a simple program and extract the text. I will manually compose a character mapping table and decode the text." Yes. This is possible, especially for case when just few fonts are presented in document. That's why we have implemented the obfuscation settings, which make the solving of decoding task almost impossible.

Let's again open the text encoding obfuscation dialog and:

switch on the duplication option for fonts - each page will use its own set of fonts, and glyphs of each font set will be relocated in a random way.
switch on the duplication option for glyphs of font - identical symbols on page will use different glyphs of symbols in font, the code of glyph will be chosen in a random way.

Settings for strong text encoding obfuscation in PDF document

After this let's obfuscate text encoding in PDF document once again and review the output PDF document with obfuscated text: document-with-strong-text-obfuscation.pdf.

Now we can observe that for letter "e" were used different glyphs and the extracted text will include symbols of this letter with different codes.
The same character is written using different glyphs after text encoding obfuscation in PDF document

The same character is written using different glyphs after text encoding obfuscation in PDF document

The word "Text" from the first line now is written via different fonts. In each of fonts is used its own random location and duplication of glyphs. Now the results of this word extraction from first and second page are absolutely different:
The same word is written using different fonts after text encoding obfuscation in PDF document

The same word is written using different fonts after text encoding obfuscation in PDF document