Return-Path: Delivered-To: apmail-incubator-pdfbox-dev-archive@minotaur.apache.org Received: (qmail 43616 invoked from network); 18 Aug 2009 16:23:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Aug 2009 16:23:20 -0000 Received: (qmail 44461 invoked by uid 500); 18 Aug 2009 16:23:39 -0000 Delivered-To: apmail-incubator-pdfbox-dev-archive@incubator.apache.org Received: (qmail 44452 invoked by uid 500); 18 Aug 2009 16:23:39 -0000 Mailing-List: contact pdfbox-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pdfbox-dev@incubator.apache.org Delivered-To: mailing list pdfbox-dev@incubator.apache.org Received: (qmail 44438 invoked by uid 99); 18 Aug 2009 16:23:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 16:23:39 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 16:23:36 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 017FE234C044 for ; Tue, 18 Aug 2009 09:23:15 -0700 (PDT) Message-ID: <1579235307.1250612594991.JavaMail.jira@brutus> Date: Tue, 18 Aug 2009 09:23:14 -0700 (PDT) From: =?utf-8?Q?Andreas_Lehmk=C3=BChler_=28JIRA=29?= To: pdfbox-dev@incubator.apache.org Subject: [jira] Updated: (PDFBOX-234) spaces lost MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PDFBOX-234?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmk=C3=BChler updated PDFBOX-234: -------------------------------------- Fix Version/s: 0.8.0-incubator > spaces lost > ----------- > > Key: PDFBOX-234 > URL: https://issues.apache.org/jira/browse/PDFBOX-234 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Priority: Minor > Fix For: 0.8.0-incubator > > > [imported from SourceForge] > http://sourceforge.net/tracker/index.php?group_id=3D78314&atid=3D552832&a= id=3D1635950 > Originally submitted by tweakerbee on 2007-01-15 07:09. > During extraction in certain PDF documents spaces will be lost. I have at= tached a file in which this problem occurs. > Here PDFTextStripper.getText() returns: > gaandeofincidenteleaardis > whereas it should be > gaande of incidentele aard is > I have used the nightly build from today (15-01-07) but the problem still= remains. > [attachment on SourceForge] > http://sourceforge.net/tracker/download.php?group_id=3D78314&atid=3D55283= 2&aid=3D1635950&file_id=3D211376 > STB336.pdf (application/pdf), 51425 bytes > document with erronous text extraction > [comment on SourceForge] > Originally sent by tweakerbee. > Logged In: YES=20 > user_id=3D1625706 > Originator: YES > The problem turned out to be in the splitting algorithm. The values here = turned out slightly too conservative. > Using 0.33f (33%) turned out to yield proper results. This might split wo= rds that are not meant to be split, however. > Maybe you could set this through a field in the TextStripper? So you can = adjust your application slightly easier to your specific needs. > This issue can be considered solved. > startOfNextWordX =3D endOfLastTextX + (wordSpacing* 0.33f); > startOfNextWordX =3D endOfLastTextX + (((wordSpacing+lastWordSpacing)/2f)= * 0.33f); > [comment on SourceForge] > Originally sent by tweakerbee. > Logged In: YES=20 > user_id=3D1625706 > Originator: YES > My previous assumption turned out to be incorrect. > The context.showString() function is responsible for outputting the strin= g. If anywhere, it should probably output the space here. > [comment on SourceForge] > Originally sent by tweakerbee. > Logged In: YES=20 > user_id=3D1625706 > Originator: YES > I am currently looking into the problem myself as well, but my complete l= ack of experience with the Portable Document Format as well as being a novi= ce Java programmer are rather limiting. > What I have found out so far is this: > The problem is in the TextStream where a TJ operator is being used to sho= w the glyphs. There are no spaces encoded in the file, but instead it uses = some character spacing information to space out the words. An example is in= cluded below. > The code I believe is responsible for extracting the text here (org.pdfbo= x.util.operator.ShowTextGlyph) does not contain any code to determine wheth= er or not a space is needed. Would it be useful to add this here? And will = this not breakdown the org.pdfbox.util.PDFHighlighter? (I have noticed some= difficulties with certain PDF documents and I wouldn't be surprised if the= difference in character count originates from this issue.) > Any help would be greatly appreciated. > Example code in STB336.pdf: > [(7?????)-278(???)-278(? ?"&????)-278(???)-278(???)-278( ??\))-278(???)-2= 78(??????\)????\012)-278( ?????'??&)-278(??)]TJ --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.