pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler <andr...@lehmi.de>
Subject Re: Major differences between PDFTextStripper and PrintTextLocations
Date Mon, 10 Aug 2015 10:18:28 GMT
Hi Gilad,

sorry for the late answer ....

I'm not sure what you're expecting. You are using 2 totally different approaches
to process a pdf. PrintTextLocations provides a lot of additional information
for every piece of text, which may vary from one character up to whole words or
lines of text. Consequently the output has to be totally different and of course
much bigger than the output of a simple text extraction.

BR
Andreas

> Gilad Denneboom <gilad.denneboom@gmail.com> hat am 10. August 2015 um 10:05
> geschrieben:
> 
> 
> No one has any ideas?
> 
> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <gilad.denneboom@gmail.com>
> wrote:
> 
> > Hi everyone,
> >
> > I'm looking for advice on a problem I'm encountering where the output of
> > PDFTextStripper and PrintTextLocations is dramatically different when
> > processing the same file.
> > For some reason, the output of PrintTextLocations is 12 times longer than
> > that of PDFTextStripper, ie the entire text is printed out 12 times,
> > instead of just once.
> >
> > I'm attaching the file in question, as well as the output produced using
> > both methods via Google Drive... Hopefully it will come through.
> >
> > I'd appreciate any ideas as to what might be causing this issue (I'm
> > guessing there's something wrong with the structure of the file), and of
> > course any possible solutions.
> >
> > Thanks in advance, Gilad.
> >
> > PS. I'm using 1.8.10.
> > ​
> >  output problem.zip
> > <https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web>
> > ​
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message