Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 78EF618267 for ; Mon, 10 Aug 2015 10:36:18 +0000 (UTC) Received: (qmail 24298 invoked by uid 500); 10 Aug 2015 10:36:13 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 24271 invoked by uid 500); 10 Aug 2015 10:36:13 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 24257 invoked by uid 99); 10 Aug 2015 10:36:12 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Aug 2015 10:36:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 120ABC0CE8 for ; Mon, 10 Aug 2015 10:36:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.1 X-Spam-Level: X-Spam-Status: No, score=-0.1 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=lehmi.de Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id r4FO6Jk_bTT4 for ; Mon, 10 Aug 2015 10:35:56 +0000 (UTC) Received: from mo4-p00-ob.smtp.rzone.de (mo4-p00-ob.smtp.rzone.de [81.169.146.161]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 754E534A6E for ; Mon, 10 Aug 2015 10:19:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1439201909; l=1661; s=domk; d=lehmi.de; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Subject: References:In-Reply-To:To:Reply-To:From:Date; bh=6zbXi9XHFR0vv0QJwv+3tUP6UjWj5VcV5htl5SPZLow=; b=E4o2SSTFJ254mu5RriOBUdodKqHuOziQ+GE0wLnGt+MC6hIvh1dXrvFe0xP465sOdBW 95XZritvXEStjd5SlVVDLuNS/hyzPYmcmZ8IikwHd9A3/j5IWzMV0yg+4Muam9NXkxT5I S2brYz7/45ZHieqLSlojBS9QjBKqzGw4M0w= X-RZG-AUTH: :LWIAZ0WpaN8UY5o8XRz0jOyrHsdLFu/Eofc5177QYpz2qXXhjsXpYVO4Ug== X-RZG-CLASS-ID: mo00 Received: from patina.store (com4.strato.de [81.169.145.237]) by smtp-ox.front (RZmta 37.8 AUTH) with ESMTPSA id e03bd1r7AAIS9N4 (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA (curve X9_62_prime256v1 with 256 ECDH bits, eq. 3072 bits RSA)) (Client did not present a certificate) for ; Mon, 10 Aug 2015 12:18:28 +0200 (CEST) Date: Mon, 10 Aug 2015 12:18:28 +0200 (CEST) From: =?UTF-8?Q?Andreas_Lehmk=C3=BChler?= Reply-To: =?UTF-8?Q?Andreas_Lehmk=C3=BChler?= To: users@pdfbox.apache.org Message-ID: <1872354751.1848252.1439201909000.JavaMail.open-xchange@patina.store> In-Reply-To: References: Subject: Re: Major differences between PDFTextStripper and PrintTextLocations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 Importance: Medium X-Mailer: Open-Xchange Mailer v7.6.2-Rev20 X-Originating-Client: com.openexchange.ox.gui.dhtml Hi Gilad, sorry for the late answer .... I'm not sure what you're expecting. You are using 2 totally different appro= aches to process a pdf. PrintTextLocations provides a lot of additional informati= on for every piece of text, which may vary from one character up to whole word= s or lines of text. Consequently the output has to be totally different and of c= ourse much bigger than the output of a simple text extraction. BR Andreas > Gilad Denneboom hat am 10. August 2015 um 10:= 05 > geschrieben: >=20 >=20 > No one has any ideas? >=20 > On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom > wrote: >=20 > > Hi everyone, > > > > I'm looking for advice on a problem I'm encountering where the output o= f > > PDFTextStripper and PrintTextLocations is dramatically different when > > processing the same file. > > For some reason, the output of PrintTextLocations is 12 times longer th= an > > that of PDFTextStripper, ie the entire text is printed out 12 times, > > instead of just once. > > > > I'm attaching the file in question, as well as the output produced usin= g > > both methods via Google Drive... Hopefully it will come through. > > > > I'd appreciate any ideas as to what might be causing this issue (I'm > > guessing there's something wrong with the structure of the file), and o= f > > course any possible solutions. > > > > Thanks in advance, Gilad. > > > > PS. I'm using 1.8.10. > > =E2=80=8B > > output problem.zip > > > > =E2=80=8B > > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org