lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: UNIX command-line indexing script?
Date Mon, 15 Mar 2004 10:35:52 GMT
Have a look at the Ant <index> task in the Lucene sandbox.  You're on 
your own, currently, to build this and understand it, but I use it 
frequently.  In fact, the sample index from our book is generated with 
this:

     <index index="${build.dir}/index"
       documenthandler="lia.common.TestDataDocumentHandler">
       <fileset dir="${data.dir}"/>
       <config basedir="${data.dir}"/>
     </index>

You can plug in your own DocumentHandler implementation to index 
different document types however you like.  The default one indexes 
.txt and .html files, but a custom implementation can do its own thing. 
  Again, to write a DocumentHandler that knows about various document 
types is not hard you will have to write your own at the moment.

Despite the (minor) amount of work you'll have to do to start using 
<index> - the infrastructure adds a lot of value: an incremental file 
system indexer (only new docs get indexed on successive runs).  
Plugging this into cron would be trivial.

	Erik

On Mar 13, 2004, at 11:45 AM, Charlie Smith wrote:

> Anyone written a simple UNIX command-line indexing script which will 
> read a
> bunch off different kinds of docs and index them?  I'd like to make a 
> cron job
> out of this so as to be able to come back and read it later during a 
> search.
>
> PERL or JAVA script would be fine.
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message