pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Text Extraction and Fonts
Date Sun, 30 Jan 2011 15:18:12 GMT
Hi,

Am 29.01.2011 22:24, schrieb Hannes Carl Meyer:
> Hi,
>
> I'm using PDFBox to extract text from various PDFs.
> Since these PDFs are from good ol' germany in german language they contain
> lots of nice umlauts (ä,ö,ü etc).
>
> On some PDFs the extraction of Umlauts fails.
>
>  From my first analysis I could imagine it is somehow because I'm not owning
> the particular PDFs font.
>
> Is it necessary to have a font installed and loaded into PDFBox to perform a
> proper extraction?
>
> Another interesting point: If I open these PDF documents which I can't
> extract Umlauts from in my Adobe Reader and try to search for an umlaut
> which is displayed properly - it fails. It also fails to manually extract
> the text via copy&  paste from the pdf.
Without having a hand on the pdf, it's hard to say what may be the reason for 
the described issue. There are different possibilities:

1.) the font isn't embebbed and the substitution made my PDFBox doesn't fit 100%
2.) the font is an embedded subset of a true type font, which will be 
substituted with another font due to an issue concerning font subsets (see [1] 
for further info) and that may lead to the same effect than 1.
3.) the pdf uses so called CIDs (charactes IDs) without a suitable mapping to 
unicode
4.) the pdf uses a type3 font without a suitable mapping to unicode
5.) you're using wrong parameters for the extraction
6.) you're using an editor with limited capabilities concerning text encoding
6.) there is still an issue with PDFBox

Following your last comment, the cases 3. or 4. are most likely.

BTW, what version of PDFBox are you using?

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-490

Mime
View raw message