Starting from version 8.4 the VintaSoft Imaging .NET SDK supports the ability to obfuscate the text encoding for all font types.
What is the purpose? For example, you publish a document with open access (or give it to a third party) and you do not want the text of document to be easily extracted by copy/paste in any PDF viewer.
Solution 1: Forbid the text extraction using flags of document security settings, however the complying with the settings will depend on application, which the user uses for viewing the document. The text can be easily extracted, but just not in any PDF viewer.
Solution 2: Completely remove the information about text encoding, mix up glyphs of font, mix up symbol codes on page. This can be done using the functionality for obfuscating the text encoding provided in VintaSoft PDF .NET Plug-in. The obfuscating mechanism completely removes the information about text encoding, mixes up the glyphs of symbols in font and on page, duplicates fonts and symbol glyphs. The text obfuscated in such way may be extracted using OCR only.
Let us review the obfuscating of simple text (obfuscateTest_source.pdf) using PdfEditorDemo application.
The document contains two pages. Four sentences have been written using three different fonts.
The first sentence on the first and second page has been written using the same font.
The text extraction panel and font viewing dialog show that all symbols have correct encoding.
The obfuscation dialog is called using the application menu: Text -> Obfuscate Text Encoding -> Settings:
Let's perform the obfuscation with the default settings and review the output document (obfuscateTest_noDuplicate_1_1.pdf).
Document review shows that the fonts are fully changed, glyphs were relocated in a random way. The text that is extracted now is displayed in the text extraction panel.
As the first sentence on both pages was written by the same font, the symbols of first word (highlighted in green) on both pages are the same:
You might say: "I will write a simple program and extract the text. I will manually compose a correspondence table and decode the text." Yes. This is possible, especially for case when just few fonts are presented in document. That's why we have implemented the obfuscation settings, which make the solving of decoding task almost impossible:
Let's switch on the duplication option for fonts - each page will use its own set of fonts, and glyphs of each font set will be relocated in a random way.
Let's switch on the duplication option for glyphs of font - identical symbols on page will use different glyphs of symbols in font, the code of glyph will be chosen in a random way. The output document: obfuscateTest_Duplicate_2_3.pdf.
Now we can observe that for letter "e" were used different glyphs and the extracted text will include symbols of this letter with different codes.
The word "Text" from the first line now is written via different fonts. In each of fonts is used its own random location and duplication of glyphs. Now the results of this word extraction from first abd second page are absolutely different:
Preventing the text extraction from PDF document by obfuscating text encoding and fonts of the document.
Articles about functionality of VintaSoft PDF .NET Plug-in.
1 post • Page 1 of 1