lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nader S. Henein" <...@bayt.net>
Subject RE: commercial websites powered by Lucene?
Date Tue, 24 Jun 2003 11:10:40 GMT
Take care that the indexing speed is also dependant on what type of
files you're working with, indexing simple HTML files will be faster
than indexing a 30 field XML file with date fields let's say. With us a
full re-index takes about 4 hours on 500 k documents, so doing
periodical full re-indexing would have too high a cost on the system


-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Ulrich Mayring
Sent: Tuesday, June 24, 2003 2:42 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


Chris Miller wrote:
> 
> The main thing I'm interested in is how you handle updates to Lucene's

> index. I'd imagine you have a fairly high turnover of CVs and jobs, so

> index updates must place a reasonable load on the CPU/disk. Do you 
> keep CVs and jobs in the same index or two different ones? And what is

> the process you use to update the index(es) - do you batch-process 
> updates or do you handle them in real-time as changes are made?

The way we do it: we re-index everything periodically in a temporary 
directory and then rename the temporary directory. That way the index 
remains accessible at all times and its currency is simply determined by

the interval I run the re-indexing in.

>  We need to be able to handle indexing about 60,000 documents/day, 
> while allowing (many) searches to continue operating alongside.

On an entry-level Sun I can index about 23 documents per second and 
these are real-life HTML pages. Thus in less than one hour you would be 
finished with a complete index run and save yourself all kinds of 
trouble with crashes during indexing etc.

On my 2 GHz Linux workstation it's even faster: more than 2000 documents

per minute, so you'd be done in half an hour.

BTW, we're not using the supplied JavaCC-based HTML parser, instead we 
got htmlparser.sourceforge.net, which is a joy to use and pretty fast.

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message