lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: How to index and search PDF documents.
Date Wed, 06 Apr 2005 07:45:59 GMT

On Apr 6, 2005, at 3:07 AM, <himani.tandon@wipro.com> wrote:
> "THAT is  HOW CAN I INDEX  and SEARCH .pdf, .ppt,. xml, .doc etc 
> DOCUMENTS  WITH LUCENE."
> I WILL BE REALLY HANKFUL IF U SOLVE MY PROBLEM.

<commercial>

Get a copy of Lucene in Action.  Otis wrote a great chapter on how to 
handle various document formats with Lucene:

	http://www.lucenebook.com/search?query=index+pdf

</commercial>

It sounds like you're using the Lucene demo application.  That is a 
reasonable starting point, but you will very quickly want to diverge 
from it and integrate in a PDF reader (PDFBox being the best open 
source library) and something to read Office documents (TextMining for 
Word docs, perhaps POI for the other formats).

Also, have a look at Nutch - it may actually be the indexing/searching 
solution you want for your intranet rather than a custom Lucene 
application.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message