pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Spacing problem with this pdf file
Date Tue, 29 Mar 2016 17:57:28 GMT
Hi,

If all your files are like that, just dump the space, and make your 
extraction on positions only. There is no guarantee that there are 
spaces in a PDF between two words anyway.

Tilman

Am 29.03.2016 um 19:36 schrieb Joel Hirsh:
> I understand, but is there anything I can do in my code to get the string
> as shown in ExtractText?
>
> I am subclassing PDFTextStripper, similar to what is done
> in PrintTextLocations, and the string coming into writeString(String
> string, List<TextPosition> textPositions) is the one where all the spaces
> occur.
>
> Thanks
>
> On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Here's what I got with ExtractText command line application:
>>
>> ______
>> ______                                          03-09 3,411.69
>> ELECTRONIC DEPOSIT     FDMS-SETTLEMENT  DEPOSIT 376249462999
>>    03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT  DEPOSIT
>> 376249462999
>>
>>
>>
>> However I think I understand the cause of your problem, because there's
>> output like this:
>>
>> String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997
>> width=4.799988]6
>> String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2
>> width=7.200012]
>>
>> i.e. space and a character at the same place. See this content stream:
>>
>> BT
>>    0 0 0 rg
>>    /F0 1 Tf
>>    1 0 0 1 29.204 460.096 Tm
>> ( ______                                         ) Tj
>>    1 0 0 1 29.204 451.096 Tm
>> ( ______                                         ) Tj
>>    /F1 1 Tf
>>    1 0 0 1 29.204 451.096 Tm
>>    (  03-09          3,411.69     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
>> DEPOSIT 376249462999                                    ) Tj
>>    1 0 0 1 29.204 442.096 Tm
>>    (  03-10          1,645.22     ELECTRONIC DEPOSIT FDMS-SETTLEMENT
>> DEPOSIT 376249462999                                    ) Tj
>> ET
>>
>> There are two lines that start at the same position 29.204 451.096, one
>> with blanks, one with a text. That is a bug by the creator of the file.
>>
>> Tilman
>>
>>
>> Am 29.03.2016 um 18:48 schrieb Joel Hirsh:
>>
>>> I thought it was attached to the first email, but it is also available at
>>>
>>> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0
>>>
>>>
>>> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>> Please upload that file somewhere.
>>>> Tilman
>>>>
>>>>
>>>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh:
>>>>
>>>> I have a couple of PDF files that have this problem.  These are
>>>>> multi-page PDF files, and on one page (the first) there are a few lines
>>>>> that get extra spaces between almost every character as seen from
>>>>> PrintTextLocations.
>>>>>
>>>>> Attached is a snippet from one of those files, the first line has the
>>>>> problem, the second line does not.
>>>>>
>>>>> In this file, the first line gets a string that is
>>>>> 0 3- 09               3 ,4 1 1. 6 9        EL E CT R ON I C  D EP O SI
T
>>>>>          F DM S -S E TT L EM E NT    D E PO S IT       37 6 24 9 46 2
99
>>>>> 9
>>>>>
>>>>> While the second line gets the text without any extra spaces.
>>>>>
>>>>> The two lines also have different spacing values as reported by
>>>>> PrintTextLocations.  In the full file, all the good lines have one
>>>>> value,
>>>>> the bad lines a different value.
>>>>>
>>>>> I cannot see any difference between the lines in Acrobat, doing
>>>>> copy/paste, Nitro editing.
>>>>>
>>>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some
>>>>> older versions I tried as well (i.e. I don't think it is any kind of
>>>>> regression)
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message