pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregor Kova─Ź <kov...@gmail.com>
Subject Re: Spacing between lines not retained
Date Fri, 29 Jul 2016 11:00:31 GMT
Hi!

API docs for PDFTextStripper (
http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html)
states that "This class will take a pdf document and strip out all of the
text and ignore the formatting and such". Please note that you can
call setAddMoreFormatting (
http://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setAddMoreFormatting(boolean))
with true and it will add a bit more formatting, but in my experience this
does not compare to using "pdftotext -layout" from Xpdf project. pdftotext
does a much better job preserving layout.

Best regards,
    Kovi

2016-07-29 12:44 GMT+02:00 Shyam Sundar <sw.craftsman@gmail.com>:

> Hi,
>
> While converting a particular pdf to txt, spacing between lines and
> paragraphs is not retained, output is just a flat text.
>
> Sample file : ftp://PfXxyEhxh:h7hHhpOh7O@ftp.emc.com/Sample.zip
>
> Looks like a file specific issue. Can you pls check ?
>
> Thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>



-- 
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
|  In A World Without Fences Who Needs Gates?  |
|              Experience Linux.               |
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message