pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zeev Sands <zeev.sa...@gmail.com>
Subject Re: Discrepancy between rendered and extracted characters.
Date Tue, 22 Apr 2014 13:02:23 GMT

When I perform the test you described, the pasted text matches the text 
extracted by pdfbox (both with bad characters).

In this particular case, I now believe that the document was scanned. A 
supporting evidence of that  is a horizontal line rendered as 
'==================' is extracted as '==~======~===~===' (using both the 
acrobat test and pdfbox). This seems random and consistent with scanner 

You make an interesting point, perhaps the font on the scanning end has 
no way of recognizing the character (¶). This character in particular is 
missed consistently. '¶' is extracted as 'ii' and '¶¶' as 'iii!'. This 
gives me some hope that I might  be able to do a better job on the 
images in the file by using a different OCR software. Although, 'Acrobat 
11.0.6 Paper Capture Plug-in' that presumably produced the document 
sounds serious...hopes diminishing.

Thank you to all for quick responses,

Best Regards,

On 04/22/2014 07:01 AM, Andreas Lehmkühler wrote:
> There are several reasons for that behaviour.
> - the pdf doesn't provide all necessary information to map the rendered text to
> a readable one
> - the pdf uses a Type3 font, which most likely can't be extracted as text
> - there is a bug or a special case isn't yet supported within PDFBox
> But without having a hand on the pdf , these are just guesses.
> First of all, perform the Acrobat test. To do so, open the pdf in quesiotn
> using acrobat and try to copy and paste the text. What do you get?

View raw message