lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runde, Kevin" <>
Subject RE: Investingating Lucene For Project
Date Tue, 01 Mar 2005 21:20:51 GMT
Also there is a book called "Lucene in Action" that was released
recently. It is a great introduction to Lucene and has sections
dedicated to indexing different text document types (txt, html, pdf,
doc, rtf). FYI I am in no way related to the book or the authors so this
is a real recommendation. It will help you quickly learn what Lucene is
and can do. It has lots of pointers to other projects that use Lucene or
expand upon it's functionality.


-----Original Message-----
From: Ben Litchfield [] 
Sent: Tuesday, March 01, 2005 3:08 PM
To: Lucene Users List
Subject: Re: Investingating Lucene For Project 

See inlined comments below.

> We have had requests from some clients who would like the ability to
> "index"  PDF files, now and possibly other text files in the future.
> PDF files live on a server and are in a structured environment. I
> like to somehow index the content inside the PDF and be able to run
> searches on that information from a web-form. The result MUST BE a
> snippet (that being some text prior to the searched word and after the
> searched word).  Does this make sense? And can Lucene do this?

Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox ( can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create

> If the product can do this, how is the best way to get rolling on a
> project of this nature? Purchase an example book, or are there simple
> examples one can pick up on? Does Lucene have a large learning curve?
> reasonably quick?

There are tutorials available on the website, and I would recommend
the "Lucene in Action" book.  There is a learning curve for lucene, but
sounds like your requirements are pretty basic so it shouldn't be that

> If all the above will work, what kind of license does this require? I
> have not been able to find a link to that yet on the jakarta site.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message