pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilad Denneboom <gilad.denneb...@gmail.com>
Subject Re: Major differences between PDFTextStripper and PrintTextLocations
Date Mon, 10 Aug 2015 16:48:03 GMT
I guessed it was something like that... Do you think it's because it was
generated with iText?

On Mon, Aug 10, 2015 at 6:35 PM, Andreas Lehmkuehler <andreas@lehmi.de>
wrote:

> Hi,
>
> Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
>
>> Hi Andreas,
>>
>> Of course the output itself is different, but I would expect that the
>> underlying text each tool processes would be the same, and it's not. Have
>> a
>> look at the first line in the PrintTextLocations output file:
>> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
>> width=2.7799988]:
>> It is repeated, with exactly the same information, 12 times throughout the
>> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991.
>>
>> Why would the same information be processed 12 times in a single run?
>>
> The pdf contains a lot of redundant information, e.g. the header is
> repeated several times (I didn't count them but I guess it's 12 times).
> PDFTextStripper eliminates overlapping text/characters and
> PrintTextLocations doesn't.
>
> BR
> Andreas
>
>
> Gilad
>>
>> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <andreas@lehmi.de>
>> wrote:
>>
>> Hi Gilad,
>>>
>>> sorry for the late answer ....
>>>
>>> I'm not sure what you're expecting. You are using 2 totally different
>>> approaches
>>> to process a pdf. PrintTextLocations provides a lot of additional
>>> information
>>> for every piece of text, which may vary from one character up to whole
>>> words or
>>> lines of text. Consequently the output has to be totally different and of
>>> course
>>> much bigger than the output of a simple text extraction.
>>>
>>> BR
>>> Andreas
>>>
>>> Gilad Denneboom <gilad.denneboom@gmail.com> hat am 10. August 2015 um
>>>>
>>> 10:05
>>>
>>>> geschrieben:
>>>>
>>>>
>>>> No one has any ideas?
>>>>
>>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
>>>>
>>> gilad.denneboom@gmail.com>
>>>
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>>
>>>>> I'm looking for advice on a problem I'm encountering where the output
>>>>>
>>>> of
>>>
>>>> PDFTextStripper and PrintTextLocations is dramatically different when
>>>>> processing the same file.
>>>>> For some reason, the output of PrintTextLocations is 12 times longer
>>>>>
>>>> than
>>>
>>>> that of PDFTextStripper, ie the entire text is printed out 12 times,
>>>>> instead of just once.
>>>>>
>>>>> I'm attaching the file in question, as well as the output produced
>>>>>
>>>> using
>>>
>>>> both methods via Google Drive... Hopefully it will come through.
>>>>>
>>>>> I'd appreciate any ideas as to what might be causing this issue (I'm
>>>>> guessing there's something wrong with the structure of the file), and
>>>>>
>>>> of
>>>
>>>> course any possible solutions.
>>>>>
>>>>> Thanks in advance, Gilad.
>>>>>
>>>>> PS. I'm using 1.8.10.
>>>>> ​
>>>>>   output problem.zip
>>>>> <
>>>>>
>>>>
>>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
>>>
>>>>
>>>> ​
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message