pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parker Seidel <par...@indeed.com>
Subject PDFont.encode throws IllegalArgumentException when encoding text that was just decoded
Date Tue, 15 Dec 2015 00:20:34 GMT
Hello pfdbox users,

I'm working on a internal recruiting application that requires hiding words
in candidates' resumes that have applied to our engineering positions. The
goal is to hide PII like first names to help eliminate unconscious biases
during the screening process. I'm starting to familiarize myself with
pdfbox and have run into some problems when trying to selectively replace
text in a pdf.

My first approach was to take the *RemoveAllText* example and modify it to
mutate the *COSString* if I detected words that needed to be removed. I
quickly realized that I needed to maintain a PDFont stack.

The first problem I ran into was in *PDFont.encode *- specifically if we
decode the text to unicode using the font and then call PDFont.encode on
the same string, I get an IllegalArgumentException

java.lang.IllegalArgumentException: U+0051 is not available in this font's
encoding: built-in (TTF)
at
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.encode(PDTrueTypeFont.java:358)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:285)

I think this error is related to or the same as PDFBOX-3152.

A concrete example (attached) using the
*sample_fonts_solidconverter.pdf* bundled
in the pdfbox/test to trigger the unexpected behavior.


PDFont font = getGraphicsState().getTextState().getFont();
COSString string = (COSString) operands.get(0);
String unicode = getUnicode(font, string)
byte[] encoded = font.encode(unicode); // throws the
IllegalArgumentException

Could not encode 'V' using Verdana error: No glyph for U+0056 in font
Verdana
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for
U+0056 in font Verdana
at
org.apache.pdfbox.pdmodel.font.PDCIDFontType2.encode(PDCIDFontType2.java:401)
at org.apache.pdfbox.pdmodel.font.PDType0Font.encode(PDType0Font.java:351)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:285)


I'm happy to provide more information or transfer this to a JIRA bug with
more test cases.

Thanks
Parker Seidel

Mime
View raw message