pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hannes Carl Meyer <hannesc...@googlemail.com>
Subject Text Extraction and Fonts
Date Sat, 29 Jan 2011 21:24:21 GMT

I'm using PDFBox to extract text from various PDFs.
Since these PDFs are from good ol' germany in german language they contain
lots of nice umlauts (ä,ö,ü etc).

On some PDFs the extraction of Umlauts fails.

>From my first analysis I could imagine it is somehow because I'm not owning
the particular PDFs font.

Is it necessary to have a font installed and loaded into PDFBox to perform a
proper extraction?

Another interesting point: If I open these PDF documents which I can't
extract Umlauts from in my Adobe Reader and try to search for an umlaut
which is displayed properly - it fails. It also fails to manually extract
the text via copy & paste from the pdf.

Thanks & Regards


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message