pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Question on text extraction
Date Mon, 06 Nov 2017 07:09:36 GMT
Am 06.11.2017 um 07:04 schrieb Jesse James Joson:
> Hi,
>
> I encounter some issue regrding on the extraction of text using PDF box
> 2.0.7. When I open the pdf file using Acrobat I see the content, it can be
> select and search. The specific character "-" cannot be read correctly,
> when the file undergo PDFbox it retrieves "?" in replacement for the hyphen.
>
> Thank you
>

Somewhat answered here:

https://pdfbox.apache.org/2.0/faq.html#notext

Another useful read to see how tricky this is:

https://stackoverflow.com/questions/45895768/pdfbox-2-0-7-extracttext-not-working-but-1-8-13-does-and-pdfreader-as-well
https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0

For a specific answer, please link to the PDF. But if Adobe can't 
extract it, then it's unlikely PDFBox can.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message