Return-Path: Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: (qmail 84790 invoked from network); 30 Jan 2011 17:57:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Jan 2011 17:57:09 -0000 Received: (qmail 56504 invoked by uid 500); 30 Jan 2011 17:57:09 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 56374 invoked by uid 500); 30 Jan 2011 17:57:07 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 56366 invoked by uid 99); 30 Jan 2011 17:57:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Jan 2011 17:57:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hannescarl@googlemail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Jan 2011 17:57:00 +0000 Received: by pzk28 with SMTP id 28so819584pzk.21 for ; Sun, 30 Jan 2011 09:56:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=DzGS1u2rJOc9R72sIHiRZL9MCejVp+eF01KGYzUrtwE=; b=GFaG9ecVKWW0m9ODU4122hYY4T/XAwZCHoi38osQeeoQ3wlzGAr6D24IsGOVYaOux1 gdOYLX8sBjezcaH5+L56w2zRKfqJg6eeYcR92Qj6imgvlg7padluIVykNWL+uSHQiRn2 uWFpKPbjr3XQmCTZOT1/cWgygnv30glXtlAq8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type; b=gD5QrFDr5PS0mydV2Tr8E1w8ntl4Ys8johu9RjlNfiH+6GJuLZXusSyuOF5rrAdXxc Pu078DIVQCiqfKQ3mCLRTq1IwlbTK+miH4Tu8kwqx5UTt5eLdbIGwD+0QN7uYOWPDDPZ Sd1CVc+0PSw8uePZfMXkUKYCGN7W/1hjGSkgk= MIME-Version: 1.0 Received: by 10.142.221.7 with SMTP id t7mr5205868wfg.440.1296410197898; Sun, 30 Jan 2011 09:56:37 -0800 (PST) Received: by 10.142.106.5 with HTTP; Sun, 30 Jan 2011 09:56:37 -0800 (PST) Reply-To: hannescarl@googlemail.com In-Reply-To: <4D45A07A.6040109@lehmi.de> References: <4D458134.4050309@lehmi.de> <4D45A07A.6040109@lehmi.de> Date: Sun, 30 Jan 2011 18:56:37 +0100 Message-ID: Subject: Re: Text Extraction and Fonts From: Hannes Carl Meyer To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=000e0cd14b885437ec049b140577 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd14b885437ec049b140577 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Andreas, great help, I'm going to check the version on the Trunk! Regards Hannes On Sun, Jan 30, 2011 at 6:31 PM, Andreas Lehmkuehler wrot= e: > Hi, > > > Am 30.01.2011 17:20, schrieb Hannes Carl Meyer: > > Hi Andreas, >> >> thank you very much for your reply! >> >> The problem occurs for example on this document >> >> https://www.sparkasse-hildesheim.de/pdf/vertragsbedingungen/057_produktb= edingungen_spk_cards.pdf >> >> I'm using the latest version of PDFBox, 1.4.0! >> > Hmm, I can confirm your issue and it seems to be case 7., the second case > 6.;-) It works fine with the current trunk (we recently made some > improvements). > > > Do you know a tool to debug a given PDF? Maybe you could have a hand on >> the >> PDF shown above. >> > To determine which fonts are used, just have a look at the pdf properties= . > The Acrobat reader and other tools provide those props. > Use the PDFDebugger [1] which comes with PDFBox to walk through a pdf on = a > logical level. > > > [1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html > > > On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler> >wrote: >> >> Hi, >>> >>> Am 29.01.2011 22:24, schrieb Hannes Carl Meyer: >>> >>> Hi, >>> >>>> >>>> I'm using PDFBox to extract text from various PDFs. >>>> Since these PDFs are from good ol' germany in german language they >>>> contain >>>> lots of nice umlauts (=E4,=F6,=FC etc). >>>> >>>> On some PDFs the extraction of Umlauts fails. >>>> >>>> From my first analysis I could imagine it is somehow because I'm not >>>> owning >>>> the particular PDFs font. >>>> >>>> Is it necessary to have a font installed and loaded into PDFBox to >>>> perform >>>> a >>>> proper extraction? >>>> >>>> Another interesting point: If I open these PDF documents which I can't >>>> extract Umlauts from in my Adobe Reader and try to search for an umlau= t >>>> which is displayed properly - it fails. It also fails to manually >>>> extract >>>> the text via copy& paste from the pdf. >>>> >>>> Without having a hand on the pdf, it's hard to say what may be the >>> reason >>> for the described issue. There are different possibilities: >>> >>> 1.) the font isn't embebbed and the substitution made my PDFBox doesn't >>> fit >>> 100% >>> 2.) the font is an embedded subset of a true type font, which will be >>> substituted with another font due to an issue concerning font subsets >>> (see >>> [1] for further info) and that may lead to the same effect than 1. >>> 3.) the pdf uses so called CIDs (charactes IDs) without a suitable >>> mapping >>> to unicode >>> 4.) the pdf uses a type3 font without a suitable mapping to unicode >>> 5.) you're using wrong parameters for the extraction >>> 6.) you're using an editor with limited capabilities concerning text >>> encoding >>> 6.) there is still an issue with PDFBox >>> >>> Following your last comment, the cases 3. or 4. are most likely. >>> >>> BTW, what version of PDFBox are you using? >>> >>> BR >>> Andreas Lehmk=FChler >>> >>> [1] https://issues.apache.org/jira/browse/PDFBOX-490 >>> >> > BR > Andreas Lehmk=FChler > --000e0cd14b885437ec049b140577--