lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@csh.rit.edu>
Subject Re: Investingating Lucene For Project
Date Tue, 01 Mar 2005 21:08:13 GMT

See inlined comments below.

> We have had requests from some clients who would like the ability to
> "index"  PDF files, now and possibly other text files in the future. The
> PDF files live on a server and are in a structured environment. I would
> like to somehow index the content inside the PDF and be able to run
> searches on that information from a web-form. The result MUST BE a text
> snippet (that being some text prior to the searched word and after the
> searched word).  Does this make sense? And can Lucene do this?


Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox (http://www.pdfbox.org/) can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create that
yourself.

> If the product can do this, how is the best way to get rolling on a
> project of this nature? Purchase an example book, or are there simple
> examples one can pick up on? Does Lucene have a large learning curve? or
> reasonably quick?

There are tutorials available on the website, and I would recommend
the "Lucene in Action" book.  There is a learning curve for lucene, but it
sounds like your requirements are pretty basic so it shouldn't be that
hard.



> If all the above will work, what kind of license does this require? I
> have not been able to find a link to that yet on the jakarta site.

http://www.apache.org/licenses/LICENSE-2.0

Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message