pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Hirsh <joelehi...@gmail.com>
Subject Re: Spacing problem with this pdf file
Date Tue, 29 Mar 2016 17:36:44 GMT
I understand, but is there anything I can do in my code to get the string
as shown in ExtractText?

I am subclassing PDFTextStripper, similar to what is done
in PrintTextLocations, and the string coming into writeString(String
string, List<TextPosition> textPositions) is the one where all the spaces
occur.

Thanks

On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Here's what I got with ExtractText command line application:
>
> ______
> ______                                          03-09 3,411.69
> ELECTRONIC DEPOSIT     FDMS-SETTLEMENT  DEPOSIT 376249462999
>   03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT  DEPOSIT
> 376249462999
>
>
>
> However I think I understand the cause of your problem, because there's
> output like this:
>
> String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
> width=4.799988]6
> String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
> width=7.200012]
>
> i.e. space and a character at the same place. See this content stream:
>
> BT
>   0 0 0 rg
>   /F0 1 Tf
>   1 0 0 1 29.204 460.096 Tm
> ( ______                                         ) Tj
>   1 0 0 1 29.204 451.096 Tm
> ( ______                                         ) Tj
>   /F1 1 Tf
>   1 0 0 1 29.204 451.096 Tm
>   (  03-09          3,411.69     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
> DEPOSIT 376249462999                                    ) Tj
>   1 0 0 1 29.204 442.096 Tm
>   (  03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
> DEPOSIT 376249462999                                    ) Tj
> ET
>
> There are two lines that start at the same position 29.204 451.096, one
> with blanks, one with a text. That is a bug by the creator of the file.
>
> Tilman
>
>
> Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
>
>> I thought it was attached to the first email, but it is also available at
>>
>> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>>
>>
>> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <THausherr@t-online.de>
>> wrote:
>>
>> Please upload that file somewhere.
>>>
>>> Tilman
>>>
>>>
>>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>>
>>> I have a couple of PDF files that have this problem.  These are
>>>> multi-page PDF files, and on one page (the first) there are a few lines
>>>> that get extra spaces between almost every character as seen from
>>>> PrintTextLocations.
>>>>
>>>> Attached is a snippet from one of those files, the first line has the
>>>> problem, the second line does not.
>>>>
>>>> In this file, the first line gets a string that is
>>>> 0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI T
>>>>         F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2 99
>>>> 9
>>>>
>>>> While the second line gets the text without any extra spaces.
>>>>
>>>> The two lines also have different spacing values as reported by
>>>> PrintTextLocations.  In the full file, all the good lines have one
>>>> value,
>>>> the bad lines a different value.
>>>>
>>>> I cannot see any difference between the lines in Acrobat, doing
>>>> copy/paste, Nitro editing.
>>>>
>>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>>> older versions I tried as well (i.e. I don't think it is any kind of
>>>> regression)
>>>>
>>>> Thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message