When I perform the test you described, the pasted text matches the text
extracted by pdfbox (both with bad characters).
In this particular case, I now believe that the document was scanned. A
supporting evidence of that is a horizontal line rendered as
'==================' is extracted as '==~======~===~===' (using both the
acrobat test and pdfbox). This seems random and consistent with scanner
errors.
You make an interesting point, perhaps the font on the scanning end has
no way of recognizing the character (¶). This character in particular is
missed consistently. '¶' is extracted as 'ii' and '¶¶' as 'iii!'. This
gives me some hope that I might be able to do a better job on the
images in the file by using a different OCR software. Although, 'Acrobat
11.0.6 Paper Capture Plug-in' that presumably produced the document
sounds serious...hopes diminishing.
Thank you to all for quick responses,
Best Regards,
-ZS
On 04/22/2014 07:01 AM, Andreas Lehmkühler wrote:
>
> There are several reasons for that behaviour.
> - the pdf doesn't provide all necessary information to map the rendered text to
> a readable one
> - the pdf uses a Type3 font, which most likely can't be extracted as text
> - there is a bug or a special case isn't yet supported within PDFBox
>
> But without having a hand on the pdf , these are just guesses.
>
> First of all, perform the Acrobat test. To do so, open the pdf in quesiotn
> using acrobat and try to copy and paste the text. What do you get?
|