pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Major differences between PDFTextStripper and PrintTextLocations
Date Mon, 10 Aug 2015 16:35:39 GMT
Hi,

Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
> Hi Andreas,
>
> Of course the output itself is different, but I would expect that the
> underlying text each tool processes would be the same, and it's not. Have a
> look at the first line in the PrintTextLocations output file:
> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
> width=2.7799988]:
> It is repeated, with exactly the same information, 12 times throughout the
> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and 991.
>
> Why would the same information be processed 12 times in a single run?
The pdf contains a lot of redundant information, e.g. the header is repeated 
several times (I didn't count them but I guess it's 12 times). PDFTextStripper 
eliminates overlapping text/characters and PrintTextLocations doesn't.

BR
Andreas

> Gilad
>
> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <andreas@lehmi.de>
> wrote:
>
>> Hi Gilad,
>>
>> sorry for the late answer ....
>>
>> I'm not sure what you're expecting. You are using 2 totally different
>> approaches
>> to process a pdf. PrintTextLocations provides a lot of additional
>> information
>> for every piece of text, which may vary from one character up to whole
>> words or
>> lines of text. Consequently the output has to be totally different and of
>> course
>> much bigger than the output of a simple text extraction.
>>
>> BR
>> Andreas
>>
>>> Gilad Denneboom <gilad.denneboom@gmail.com> hat am 10. August 2015 um
>> 10:05
>>> geschrieben:
>>>
>>>
>>> No one has any ideas?
>>>
>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
>> gilad.denneboom@gmail.com>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm looking for advice on a problem I'm encountering where the output
>> of
>>>> PDFTextStripper and PrintTextLocations is dramatically different when
>>>> processing the same file.
>>>> For some reason, the output of PrintTextLocations is 12 times longer
>> than
>>>> that of PDFTextStripper, ie the entire text is printed out 12 times,
>>>> instead of just once.
>>>>
>>>> I'm attaching the file in question, as well as the output produced
>> using
>>>> both methods via Google Drive... Hopefully it will come through.
>>>>
>>>> I'd appreciate any ideas as to what might be causing this issue (I'm
>>>> guessing there's something wrong with the structure of the file), and
>> of
>>>> course any possible solutions.
>>>>
>>>> Thanks in advance, Gilad.
>>>>
>>>> PS. I'm using 1.8.10.
>>>> ​
>>>>   output problem.zip
>>>> <
>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
>>>
>>>> ​
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message