Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (nike.apache.org: domain of hannescarl@googlemail.com
 designates 209.85.210.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=mime-version:reply-to:in-reply-to:references:date:message-id
         :subject:from:to:content-type;
        b=gD5QrFDr5PS0mydV2Tr8E1w8ntl4Ys8johu9RjlNfiH+6GJuLZXusSyuOF5rrAdXxc
         Pu078DIVQCiqfKQ3mCLRTq1IwlbTK+miH4Tu8kwqx5UTt5eLdbIGwD+0QN7uYOWPDDPZ
         Sd1CVc+0PSw8uePZfMXkUKYCGN7W/1hjGSkgk=
MIME-Version: 1.0
Reply-To: hannescarl@googlemail.com
In-Reply-To: <4D45A07A.6040109@lehmi.de>
References: <AANLkTikypzxrxY+F7g-k90np2yu_r=WW7h1KzTPj=jvb@mail.gmail.com>
	<4D458134.4050309@lehmi.de>
	<AANLkTikiGmSvA9z5H5yFE9nJkBV8Tv7R069KjOZ5Qdif@mail.gmail.com>
	<4D45A07A.6040109@lehmi.de>
Date: Sun, 30 Jan 2011 18:56:37 +0100
Message-ID: <AANLkTimp2ZtFCmLA5aDs54TarW47vMmbOBC=6OSWgLGE@mail.gmail.com>
Subject: Re: Text Extraction and Fonts
From: Hannes Carl Meyer <hannescarl@googlemail.com>
To: users@pdfbox.apache.org
Content-Type: multipart/alternative; boundary=000e0cd14b885437ec049b140577

--000e0cd14b885437ec049b140577
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Andreas,

great help, I'm going to check the version on the Trunk!

Regards

Hannes

On Sun, Jan 30, 2011 at 6:31 PM, Andreas Lehmkuehler <andreas@lehmi.de>wrot=
e:

> Hi,
>
>
> Am 30.01.2011 17:20, schrieb Hannes Carl Meyer:
>
>  Hi Andreas,
>>
>> thank you very much for your reply!
>>
>> The problem occurs for example on this document
>>
>> https://www.sparkasse-hildesheim.de/pdf/vertragsbedingungen/057_produktb=
edingungen_spk_cards.pdf
>>
>> I'm using the latest version of PDFBox, 1.4.0!
>>
> Hmm, I can confirm your issue and it seems to be case 7., the second case
> 6.;-) It works fine with the current trunk (we recently made some
> improvements).
>
>
>  Do you know a tool to debug a given PDF? Maybe you could have a hand on
>> the
>> PDF shown above.
>>
> To determine which fonts are used, just have a look at the pdf properties=
.
> The Acrobat reader and other tools provide those props.
> Use the PDFDebugger [1] which comes with PDFBox to walk through a pdf on =
a
> logical level.
>
>
> [1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html
>
>
>  On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler<andreas@lehmi.de
>> >wrote:
>>
>>  Hi,
>>>
>>> Am 29.01.2011 22:24, schrieb Hannes Carl Meyer:
>>>
>>>  Hi,
>>>
>>>>
>>>> I'm using PDFBox to extract text from various PDFs.
>>>> Since these PDFs are from good ol' germany in german language they
>>>> contain
>>>> lots of nice umlauts (=E4,=F6,=FC etc).
>>>>
>>>> On some PDFs the extraction of Umlauts fails.
>>>>
>>>>  From my first analysis I could imagine it is somehow because I'm not
>>>> owning
>>>> the particular PDFs font.
>>>>
>>>> Is it necessary to have a font installed and loaded into PDFBox to
>>>> perform
>>>> a
>>>> proper extraction?
>>>>
>>>> Another interesting point: If I open these PDF documents which I can't
>>>> extract Umlauts from in my Adobe Reader and try to search for an umlau=
t
>>>> which is displayed properly - it fails. It also fails to manually
>>>> extract
>>>> the text via copy&   paste from the pdf.
>>>>
>>>>  Without having a hand on the pdf, it's hard to say what may be the
>>> reason
>>> for the described issue. There are different possibilities:
>>>
>>> 1.) the font isn't embebbed and the substitution made my PDFBox doesn't
>>> fit
>>> 100%
>>> 2.) the font is an embedded subset of a true type font, which will be
>>> substituted with another font due to an issue concerning font subsets
>>> (see
>>> [1] for further info) and that may lead to the same effect than 1.
>>> 3.) the pdf uses so called CIDs (charactes IDs) without a suitable
>>> mapping
>>> to unicode
>>> 4.) the pdf uses a type3 font without a suitable mapping to unicode
>>> 5.) you're using wrong parameters for the extraction
>>> 6.) you're using an editor with limited capabilities concerning text
>>> encoding
>>> 6.) there is still an issue with PDFBox
>>>
>>> Following your last comment, the cases 3. or 4. are most likely.
>>>
>>> BTW, what version of PDFBox are you using?
>>>
>>> BR
>>> Andreas Lehmk=FChler
>>>
>>> [1] https://issues.apache.org/jira/browse/PDFBOX-490
>>>
>>
> BR
> Andreas Lehmk=FChler
>

--000e0cd14b885437ec049b140577--