pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harrington, Ferdinand B" <Ferdinand.Harring...@ManTech.com>
Subject RE: extract bullet points from a PDF
Date Thu, 29 Sep 2016 19:11:04 GMT
I found PDFText2HTML.java. Is there an example of how to call it?
Outlook distorted my message. The data is indented like this
As bullets:

Abc
Def
     Xyz
     Ghi
          123
          456

Thank you.

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Thursday, September 29, 2016 2:44 PM
To: users@pdfbox.apache.org
Subject: Re: extract bullet points from a PDF

Am 29.09.2016 um 15:08 schrieb win harrington:
> I would like to extract all the lists of bullet points from a PDF fileand put them into
an xml format.
> The items are indented. I wantthe text and the indentation level.
> The input is like this:
>     - abc
>     - def
>
>     - xyz
>     - ghi
>
>     - 123
>     - 456
>
>
> Can I convert that to:abc def   xyz   ghi      123      456
> The last step will be toadd tags. I have code to do this:
> <abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>
       <123></123>
>          <456></456>

This sounds like an ordinary java question, i.e. parse some text. PDFBox
does have some rudimentary paragraph detection, I don't know if it
works. Try the PDFText2HTML tool in the source download.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


________________________________

This e-mail and any attachments are intended only for the use of the addressee(s) named herein
and may contain proprietary information. If you are not the intended recipient of this e-mail
or believe that you received this email in error, please take immediate action to notify the
sender of the apparent error by reply e-mail; permanently delete the e-mail and any attachments
from your computer; and do not disseminate, distribute, use, or copy this message and any
attachments.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Mime
View raw message