pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Text Extraction and Fonts
Date Sun, 30 Jan 2011 17:31:38 GMT

Am 30.01.2011 17:20, schrieb Hannes Carl Meyer:
> Hi Andreas,
> thank you very much for your reply!
> The problem occurs for example on this document
> https://www.sparkasse-hildesheim.de/pdf/vertragsbedingungen/057_produktbedingungen_spk_cards.pdf
> I'm using the latest version of PDFBox, 1.4.0!
Hmm, I can confirm your issue and it seems to be case 7., the second case 6.;-) 
It works fine with the current trunk (we recently made some improvements).

> Do you know a tool to debug a given PDF? Maybe you could have a hand on the
> PDF shown above.
To determine which fonts are used, just have a look at the pdf properties. The 
Acrobat reader and other tools provide those props.
Use the PDFDebugger [1] which comes with PDFBox to walk through a pdf on a 
logical level.

[1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html

> On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler<andreas@lehmi.de>wrote:
>> Hi,
>> Am 29.01.2011 22:24, schrieb Hannes Carl Meyer:
>>   Hi,
>>> I'm using PDFBox to extract text from various PDFs.
>>> Since these PDFs are from good ol' germany in german language they contain
>>> lots of nice umlauts (ä,ö,ü etc).
>>> On some PDFs the extraction of Umlauts fails.
>>>   From my first analysis I could imagine it is somehow because I'm not
>>> owning
>>> the particular PDFs font.
>>> Is it necessary to have a font installed and loaded into PDFBox to perform
>>> a
>>> proper extraction?
>>> Another interesting point: If I open these PDF documents which I can't
>>> extract Umlauts from in my Adobe Reader and try to search for an umlaut
>>> which is displayed properly - it fails. It also fails to manually extract
>>> the text via copy&   paste from the pdf.
>> Without having a hand on the pdf, it's hard to say what may be the reason
>> for the described issue. There are different possibilities:
>> 1.) the font isn't embebbed and the substitution made my PDFBox doesn't fit
>> 100%
>> 2.) the font is an embedded subset of a true type font, which will be
>> substituted with another font due to an issue concerning font subsets (see
>> [1] for further info) and that may lead to the same effect than 1.
>> 3.) the pdf uses so called CIDs (charactes IDs) without a suitable mapping
>> to unicode
>> 4.) the pdf uses a type3 font without a suitable mapping to unicode
>> 5.) you're using wrong parameters for the extraction
>> 6.) you're using an editor with limited capabilities concerning text
>> encoding
>> 6.) there is still an issue with PDFBox
>> Following your last comment, the cases 3. or 4. are most likely.
>> BTW, what version of PDFBox are you using?
>> BR
>> Andreas Lehmkühler
>> [1] https://issues.apache.org/jira/browse/PDFBOX-490

Andreas Lehmkühler

View raw message