lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <tm-ole...@comcast.net>
Subject Extracting formatted text from PDF files
Date Thu, 22 Mar 2007 18:08:01 GMT
Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right. I am wondering if PDFBox provides better support
for this kind of thing. Thanks.

Mike O'Leary


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message