pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: extract bullet points from a PDF
Date Thu, 29 Sep 2016 18:43:39 GMT
Am 29.09.2016 um 15:08 schrieb win harrington:
> I would like to extract all the lists of bullet points from a PDF fileand put them into
an xml format.
> The items are indented. I wantthe text and the indentation level.
> The input is like this:
>     - abc
>     - def
>     
>     - xyz
>     - ghi
>     
>     - 123
>     - 456
>
>
> Can I convert that to:abc def   xyz   ghi      123      456
> The last step will be toadd tags. I have code to do this:
> <abc></abc><def></def>    <xyz></xyz>    <ghi></ghi>
       <123></123>
>          <456></456>

This sounds like an ordinary java question, i.e. parse some text. PDFBox 
does have some rudimentary paragraph detection, I don't know if it 
works. Try the PDFText2HTML tool in the source download.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message