pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler <andr...@lehmi.de>
Subject Re: Discrepancy between rendered and extracted characters.
Date Tue, 22 Apr 2014 11:01:50 GMT

> Zeev Sands <zeev.sands@gmail.com> hat am 19. April 2014 um 20:57 geschrieben:
> Hello,
> I am new to pdfbox and pdf format in general, so I apologize, if my
> questions are uninformed.
> I am trying to extract text from a pdf file and some of the characters
> correctly rendered on the screen (via acrobat) are coming out funny. 99%
> of the characters from the pdf are extracted correctly, but in one
> place, for example, what appears like a letter X on the screen is
> extracted as '}{', in another place two paragraph symbols  (¶¶) are
> extracted as 'iii!'.
> After poking around PDFStreamEngine and PDFStreamParser, I can see that
> the string rendering as 'X' on the screen is coming out of the pdf
> stream as <007D007B0020> which is 00 + '}' + 00  + '{' 00 + ' ', so that
> is what is extracted, yet on the screen it is clearly an X with the
> backward slash thicker then the forward slash, set in a nice serif font,
> so as far as I understand, inside the pdf it *is* '}{ ', but it renders
> as 'X' on the screen.
> Is there any way I can get that X? or more importantly those¶¶? Where in
> the pdfbox code can I look to figure it out? Perhaps, I am missing the
> basic understanding of how character rendering works in pdf. Could
> someone, please, point me in the right direction? references, links etc?

There are several reasons for that behaviour.
- the pdf doesn't provide all necessary information to map the rendered text to
a readable one
- the pdf uses a Type3 font, which most likely can't be extracted as text
- there is a bug or a special case isn't yet supported within PDFBox

But without having a hand on the pdf , these are just guesses.

First of all, perform the Acrobat test. To do so, open the pdf in quesiotn
using acrobat and try to copy and paste the text. What do you get?

> Thank you,
> -ZS

Andreas Lehmkühler

View raw message