Return-Path: Delivered-To: apmail-incubator-pdfbox-dev-archive@minotaur.apache.org Received: (qmail 53440 invoked from network); 18 Aug 2009 20:30:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Aug 2009 20:30:26 -0000 Received: (qmail 67138 invoked by uid 500); 18 Aug 2009 20:30:45 -0000 Delivered-To: apmail-incubator-pdfbox-dev-archive@incubator.apache.org Received: (qmail 67122 invoked by uid 500); 18 Aug 2009 20:30:45 -0000 Mailing-List: contact pdfbox-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pdfbox-dev@incubator.apache.org Delivered-To: mailing list pdfbox-dev@incubator.apache.org Received: (qmail 67112 invoked by uid 99); 18 Aug 2009 20:30:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 20:30:45 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 20:30:36 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id CCF83234C044 for ; Tue, 18 Aug 2009 13:30:14 -0700 (PDT) Message-ID: <1874544733.1250627414824.JavaMail.jira@brutus> Date: Tue, 18 Aug 2009 13:30:14 -0700 (PDT) From: "Dmitry Gutso (JIRA)" To: pdfbox-dev@incubator.apache.org Subject: [jira] Issue Comment Edited: (PDFBOX-234) spaces lost MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PDFBOX-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744517#action_12744517 ] Dmitry Gutso edited comment on PDFBOX-234 at 8/18/09 1:28 PM: -------------------------------------------------------------- I posted my question here because it seems like. I have a problem with addition of the space character between words in the outer text by PDFTextStripper. I used PDFBox-0.7.3, JAVA 1.6.0 example (the string is copied by textbuffer of Adobe from the indicated pdf-file): source pdf: http://www.cmegroup.com/daily_bulletin/Section02A_Summary_Volume_And_Open_Interest_(Excludes%20TRAKRS)_Comm_Alt_Invest_Futures_And_Options_2009152.pdf IOM DIVISION 1523788 456934 1980722 23017621 + 105674 3147531 27408756 after parsing: IOM DIVISION 1523788 4569341980722 23017621 + 105674 3147531 27408756 cause: String[229.68,212.15997 fs=1.0 xscale=6.0 height=3.7500005 width=14.412003]5693 String[244.08,212.15997 fs=1.0 xscale=6.0 height=3.7500005 width=85.805984]41 String[290.58,212.15997 fs=1.0 xscale=6.0 height=3.7500005 width=10.8089905]980 I tried use the classes of PDFBox-0.7.4.jar but it wasn't succes. The code that was checked in for the PDFBOX-349 issue used "0.7.4", but this does not work for me... Sorry. was (Author: gtsdmtry): I posted my question here because it seems like. I have a problem with addition of the space character between words in the outer text by PDFTextStripper. I used PDFBox-0.7.3, JAVA 1.6.0 example (the string is copied by textbuffer of Adobe from the indicated pdf-file): source pdf: http://www.cmegroup.com/daily_bulletin/Section02A_Summary_Volume_And_Open_Interest_(Excludes%20TRAKRS)_Comm_Alt_Invest_Futures_And_Options_2009152.pdf IOM DIVISION 1523788 456934 1980722 23017621 + 105674 3147531 27408756 after parsing: IOM DIVISION 1523788 4569341980722 23026421 + 114474 3147531 27408756 cause: String[229.68,212.15997 fs=1.0 xscale=6.0 height=3.7500005 width=14.412003]5693 String[244.08,212.15997 fs=1.0 xscale=6.0 height=3.7500005 width=85.805984]41 String[290.58,212.15997 fs=1.0 xscale=6.0 height=3.7500005 width=10.8089905]980 I tried use the classes of PDFBox-0.7.4.jar but it wasn't succes. The code that was checked in for the PDFBOX-349 issue used "0.7.4", but this does not work for me... Sorry. > spaces lost > ----------- > > Key: PDFBOX-234 > URL: https://issues.apache.org/jira/browse/PDFBOX-234 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Priority: Minor > Fix For: 0.8.0-incubator > > > [imported from SourceForge] > http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1635950 > Originally submitted by tweakerbee on 2007-01-15 07:09. > During extraction in certain PDF documents spaces will be lost. I have attached a file in which this problem occurs. > Here PDFTextStripper.getText() returns: > gaandeofincidenteleaardis > whereas it should be > gaande of incidentele aard is > I have used the nightly build from today (15-01-07) but the problem still remains. > [attachment on SourceForge] > http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1635950&file_id=211376 > STB336.pdf (application/pdf), 51425 bytes > document with erronous text extraction > [comment on SourceForge] > Originally sent by tweakerbee. > Logged In: YES > user_id=1625706 > Originator: YES > The problem turned out to be in the splitting algorithm. The values here turned out slightly too conservative. > Using 0.33f (33%) turned out to yield proper results. This might split words that are not meant to be split, however. > Maybe you could set this through a field in the TextStripper? So you can adjust your application slightly easier to your specific needs. > This issue can be considered solved. > startOfNextWordX = endOfLastTextX + (wordSpacing* 0.33f); > startOfNextWordX = endOfLastTextX + (((wordSpacing+lastWordSpacing)/2f)* 0.33f); > [comment on SourceForge] > Originally sent by tweakerbee. > Logged In: YES > user_id=1625706 > Originator: YES > My previous assumption turned out to be incorrect. > The context.showString() function is responsible for outputting the string. If anywhere, it should probably output the space here. > [comment on SourceForge] > Originally sent by tweakerbee. > Logged In: YES > user_id=1625706 > Originator: YES > I am currently looking into the problem myself as well, but my complete lack of experience with the Portable Document Format as well as being a novice Java programmer are rather limiting. > What I have found out so far is this: > The problem is in the TextStream where a TJ operator is being used to show the glyphs. There are no spaces encoded in the file, but instead it uses some character spacing information to space out the words. An example is included below. > The code I believe is responsible for extracting the text here (org.pdfbox.util.operator.ShowTextGlyph) does not contain any code to determine whether or not a space is needed. Would it be useful to add this here? And will this not breakdown the org.pdfbox.util.PDFHighlighter? (I have noticed some difficulties with certain PDF documents and I wouldn't be surprised if the difference in character count originates from this issue.) > Any help would be greatly appreciated. > Example code in STB336.pdf: > [(7?????)-278(???)-278(? ?"&????)-278(???)-278(???)-278( ??\))-278(???)-278(??????\)????\012)-278( ?????'??&)-278(??)]TJ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.