pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Carrier (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PDFBOX-61) Spaces in extracted file
Date Tue, 24 Feb 2009 16:34:03 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676327#action_12676327

Brian Carrier commented on PDFBOX-61:

Note that Adobe Reader also messes up on this file. It is because PDFBox needs to guess where
some spaces should go and the guessing works better with some fonts than others. The trunk
currently has a calculation in PDFTextStripper.writePage() that uses a value of 0.50 to estimate
the next location. When I change that value to 0.65, then the Tom_3 file comes out fine (0.60
still causes an extra space). However, several of the regression tests start to fail quite
badly when 0.60 and above are used...

There seem to be two options:
1) We make the fraction setting be more configurable via an API so that callers can change
it for files that they know have non-typical font shapes and sizes (and keep the current 0.5
value as the default).
2) We try to find a better way to estimate where the next character should be.

> Spaces in extracted file
> ------------------------
>                 Key: PDFBOX-61
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-61
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1208824
> Originally submitted by nobody on 2005-05-25 16:40.
> In trying to integrate with lucene, I was having 
> problems.  The Lucene people suggested that I check 
> the output of extract utility against one of my test pdf's.  
> When I did, I saw spaces placed inside many of the 
> words.  I was on version 0.7.0.  So I downloaded 0.7.1 
> and see the same results.
> One of the test files where I see this issue is attached.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1208824&file_id=135995
> Tom_3.pdf (application/pdf), 10145 bytes
> Test pdf file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message