pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler <andr...@lehmi.de>
Subject Re: PDF text extraction result different from what they look in PDF reader application
Date Tue, 06 May 2014 09:56:20 GMT
Hi,

> Qingchao Kong <kqingchao@gmail.com> hat am 5. Mai 2014 um 12:50 geschrieben:
>
>
> Hi, I am using PDFBox to extract text from PDF files.
> I noticed that, for some PDF files(usually old PDFs), when you select
> some text using your mouse in the PDF reader application (I use Evince
> on Ubuntu), some other text come up, different from the text when you
> don't select them.
>
> I find that PDFBox sometimes actually extract the selected text, not
> the text when you don't select them. Could anybody tell me why this
> happen? Am I understood?
Sounds like a scanned document. Some scanners combine the scanned picture and
the scanned text (using a more or less acurate OCR software) in one pdf.
The picture is visible and the text is invisible but can be extracted, so that
the displayed content differs from the extracted one.

BR
Andreas Lehmkühler

Mime
View raw message