pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shyam Sundar <sw.crafts...@gmail.com>
Subject Re: Spacing between lines not retained
Date Fri, 29 Jul 2016 11:19:34 GMT
Thanks Kovi for quick response.

Well why does it fail only for a particular file, a replica of same file
generated using another pdf library works perfectly fine with
PDFTextStripper ... isn't it strange and look like a bug ?

I hope you checked shared Sample.zip, it has both working & non-working
files.

Regards.

On Fri, Jul 29, 2016 at 4:30 PM, Gregor Kova─Ź <kovica@gmail.com> wrote:

> Hi!
>
> API docs for PDFTextStripper (
>
> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html
> )
> states that "This class will take a pdf document and strip out all of the
> text and ignore the formatting and such". Please note that you can
> call setAddMoreFormatting (
>
> http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setAddMoreFormatting(boolean)
> )
> with true and it will add a bit more formatting, but in my experience this
> does not compare to using "pdftotext -layout" from Xpdf project. pdftotext
> does a much better job preserving layout.
>
> Best regards,
>     Kovi
>
> 2016-07-29 12:44 GMT+02:00 Shyam Sundar <sw.craftsman@gmail.com>:
>
> > Hi,
> >
> > While converting a particular pdf to txt, spacing between lines and
> > paragraphs is not retained, output is just a flat text.
> >
> > Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip
> >
> > Looks like a file specific issue. Can you pls check ?
> >
> > Thanks.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
>
>
>
> --
> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
> |  In A World Without Fences Who Needs Gates?  |
> |              Experience Linux.               |
> -~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message