lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "W. Eliot Kimber" <el...@isogen.com>
Subject Re: indexing PDF files
Date Fri, 03 May 2002 13:59:34 GMT
"Moturu,Praveen" wrote:
> 
> Good Morning to you all. Can I assume none of the poeple on the lucene user
> group had implemented indexing a pdf document using lucene. If some one
> has.. Please help me by providing the solution.

You can try using Eytemon's PJ library (www.eytemon.com). But be aware
that the code as provided does not support some features of PDF and has
some bugs that prevent it from reading some PDFs. 

Note also that there are some inherent problems with full-text indexing
of PDFs, namely that the word order in the PDF does not necessarily
reflect its reading order (for example, in two-column layouts), so if
your tokenizer is doing phrase analysis it may produce incorrect
results. You can see this by doing a multi-word search in Acrobat Reader
on a two-column document. It can also be difficult to accurately
determine word boundaries because of the way that PDF can represent text
strings as sequences of characters and placement instructions. The
Adobe-provided C libraries have largely solved this problem but the PJ
library does not--you will have to write your own algorithms to reduce
text sequences with explicit kerning instructions into meaningful
tokens. Not impossible but takes a little doing. 

If you have money to spend you could license the Adobe PDF libraries and
create a Java binding for them. It does not appear that Adobe has any
plans to provide a Java library for accessing PDFs, free or otherwise.

However, implementing a Java PDF reader would not be too hard--I started
trying to implement one just to see how hard it would be and got as a
far as being able to get page objects by page number after an intense
weekend's work [unfortunately my employment contract prevents me from
creating open-source software without explicit approval and I didn't
want to create a PDF library that wasn't open source, so I haven't done
any more work on it yet]. The PDF spec (www.pdfzone.com) is pretty
clear, although the PDF format is pretty convoluted (lots of byte
offsets and such). But once you get the basic infrastructure in place
for parsing out specific objects, the rest of it is just tedious parser
implementation--there are scads of different field types once you get
down to text streams.

Adding the business logic to figure out where things are on the page
would be more involved--you'd have to implement Adobe's layout logic.
However, you need this functionality in order to correlate PDF
annotations (links, bookmarks, notes) to the page objects they relate
to--it's all done with bounding boxes.

Cheers,

Eliot
-- 
W. Eliot Kimber, eliot@isogen.com
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message