pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Extracting layout information and text from searchable PDF
Date Tue, 28 Feb 2017 00:46:33 GMT
Might be relevant:

https://github.com/JonathanLink/PDFLayoutTextStripper

This might be helpful:
 
https://github.com/apache/tika/pull/152

If you want to extract tables, take a look at Tabula:
http://tabula.technology/


-----Original Message-----
From: viraf.bankwalla@yahoo.com.INVALID [mailto:viraf.bankwalla@yahoo.com.INVALID] 
Sent: Monday, February 27, 2017 1:36 PM
To: users@pdfbox.apache.org
Subject: Extracting layout information and text from searchable PDF

I have a number of searchable PDF documents from which I want to extract layout information
and text.  These documents are mixed in that some pages may be structured (e.g. forms) while
others may be unstructured free form text (e.g. letters, reports, etc). I was wondering if
there were any projects that provided such capabilities.  I am familiar with PdfTextExtractor
and it would probably be a starting point if I was to build this functionality out.
Thanks
- viraf

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Mime
View raw message