pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Spacing between lines not retained
Date Fri, 29 Jul 2016 17:13:35 GMT
Am 29.07.2016 um 13:19 schrieb Shyam Sundar:
> Thanks Kovi for quick response.
>
> Well why does it fail only for a particular file, a replica of same file
> generated using another pdf library works perfectly fine with
> PDFTextStripper ... isn't it strange and look like a bug ?
>
> I hope you checked shared Sample.zip, it has both working & non-working
> files.

The "working" file has lines with one space, that is why.

That is what I'd expected. If you want a perfectly formatted text, why 
not use the PDF? Text extraction is usually for searching.

You can also use PrintTextLocations.java example, this will show the 
coordinates of every character. The DrawPrintTextLocations examples will 
show you that and also the visual location of the glyphs in an image 
rendering.

What you could also try is setParagraphStart("\n") and/or 
setParagraphEnd("\n").

Tilman

>
> Regards.
>
> On Fri, Jul 29, 2016 at 4:30 PM, Gregor Kova─Ź <kovica@gmail.com> wrote:
>
>> Hi!
>>
>> API docs for PDFTextStripper (
>>
>> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html
>> )
>> states that "This class will take a pdf document and strip out all of the
>> text and ignore the formatting and such". Please note that you can
>> call setAddMoreFormatting (
>>
>> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setAddMoreFormatting(boolean)
>> )
>> with true and it will add a bit more formatting, but in my experience this
>> does not compare to using "pdftotext -layout" from Xpdf project. pdftotext
>> does a much better job preserving layout.
>>
>> Best regards,
>>      Kovi
>>
>> 2016-07-29 12:44 GMT+02:00 Shyam Sundar <sw.craftsman@gmail.com>:
>>
>>> Hi,
>>>
>>> While converting a particular pdf to txt, spacing between lines and
>>> paragraphs is not retained, output is just a flat text.
>>>
>>> Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip
>>>
>>> Looks like a file specific issue. Can you pls check ?
>>>
>>> Thanks.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> --
>> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
>> |  In A World Without Fences Who Needs Gates?  |
>> |              Experience Linux.               |
>> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message