pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Issues regarding PDFBOX
Date Mon, 29 May 2017 15:25:33 GMT
Am 29.05.2017 um 08:56 schrieb Kunal Kashyap:
> I am trying to read text data from a pdf file using PdfBox API. So ,I 
> want to skip all the charts data and images in the output .txt file . 
> Can anyone help me regarding this. Also I want to extract data in 
> proper alignment.
> PFA is the sample pdf file and sample .txt file(this is my desired 
> output file)

Please have a look at the ExtractTextByArea.java example in the source 
code download, this will allow you to extract from a predefined area.

There is no way in PDF to "exclude tables" because there is no table 
concept in PDF like in HTML. It's just a bunch of lines with text. You 
would need heuristics to guess what's a table and what isn't.

Re order, use the setSortByPosition() method.

If you want exact positions of everything, have a look at the 
PrintTextLocations.java example.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message