pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hannes Carl Meyer <hannesc...@googlemail.com>
Subject Re: Text Extraction and Fonts
Date Sun, 30 Jan 2011 16:20:39 GMT
Hi Andreas,

thank you very much for your reply!

The problem occurs for example on this document

I'm using the latest version of PDFBox, 1.4.0!

Do you know a tool to debug a given PDF? Maybe you could have a hand on the
PDF shown above.



On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler <andreas@lehmi.de>wrote:

> Hi,
> Am 29.01.2011 22:24, schrieb Hannes Carl Meyer:
>  Hi,
>> I'm using PDFBox to extract text from various PDFs.
>> Since these PDFs are from good ol' germany in german language they contain
>> lots of nice umlauts (ä,ö,ü etc).
>> On some PDFs the extraction of Umlauts fails.
>>  From my first analysis I could imagine it is somehow because I'm not
>> owning
>> the particular PDFs font.
>> Is it necessary to have a font installed and loaded into PDFBox to perform
>> a
>> proper extraction?
>> Another interesting point: If I open these PDF documents which I can't
>> extract Umlauts from in my Adobe Reader and try to search for an umlaut
>> which is displayed properly - it fails. It also fails to manually extract
>> the text via copy&  paste from the pdf.
> Without having a hand on the pdf, it's hard to say what may be the reason
> for the described issue. There are different possibilities:
> 1.) the font isn't embebbed and the substitution made my PDFBox doesn't fit
> 100%
> 2.) the font is an embedded subset of a true type font, which will be
> substituted with another font due to an issue concerning font subsets (see
> [1] for further info) and that may lead to the same effect than 1.
> 3.) the pdf uses so called CIDs (charactes IDs) without a suitable mapping
> to unicode
> 4.) the pdf uses a type3 font without a suitable mapping to unicode
> 5.) you're using wrong parameters for the extraction
> 6.) you're using an editor with limited capabilities concerning text
> encoding
> 6.) there is still an issue with PDFBox
> Following your last comment, the cases 3. or 4. are most likely.
> BTW, what version of PDFBox are you using?
> BR
> Andreas Lehmkühler
> [1] https://issues.apache.org/jira/browse/PDFBOX-490

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message