pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zeev Sands <zeev.sa...@gmail.com>
Subject Discrepancy between rendered and extracted characters.
Date Sat, 19 Apr 2014 18:57:10 GMT

I am new to pdfbox and pdf format in general, so I apologize, if my 
questions are uninformed.

I am trying to extract text from a pdf file and some of the characters 
correctly rendered on the screen (via acrobat) are coming out funny. 99% 
of the characters from the pdf are extracted correctly, but in one 
place, for example, what appears like a letter X on the screen is 
extracted as '}{', in another place two paragraph symbols  (¶¶) are 
extracted as 'iii!'.

After poking around PDFStreamEngine and PDFStreamParser, I can see that 
the string rendering as 'X' on the screen is coming out of the pdf 
stream as <007D007B0020> which is 00 + '}' + 00  + '{' 00 + ' ', so that 
is what is extracted, yet on the screen it is clearly an X with the 
backward slash thicker then the forward slash, set in a nice serif font, 
so as far as I understand, inside the pdf it *is* '}{ ', but it renders 
as 'X' on the screen.

Is there any way I can get that X? or more importantly those¶¶? Where in 
the pdfbox code can I look to figure it out? Perhaps, I am missing the 
basic understanding of how character rendering works in pdf. Could 
someone, please, point me in the right direction? references, links etc?

Thank you,

View raw message