lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nader S. Henein" <...@bayt.net>
Subject RE: commercial websites powered by Lucene?
Date Tue, 24 Jun 2003 11:19:02 GMT
We thought of that in the beginning and then we became more comfortable
with multiple indices for simple backup purposes, and now our indices
are in excess of 100megs, and transferring that kind of data between
three machines sitting in the same data center is passable, but once you
start thinking of distributed webservers in different hosting
facilities, copying  100Megs every 20 minutes, or even every hour
becomes financially expensive. 

Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz with
two gegs of memory, and I've never seen the CPU usage go over 0.8 at
peek time with the indexer running. Try it out first, take your time to
gather your own numbers so you can really get  a feel of what set up
fits you best.

Nader



-----Original Message-----
From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 2:58 PM
To: lucene-user@jakarta.apache.org
Subject: Re: commercial websites powered by Lucene?


Thanks David, that's about what I figured. Of course if the servers are
pulling the information then a central holding table that contains only
new data doesn't make much sense anymore. Instead I guess the easiest
approach would be to have a central table that contains the entire
dataset, and has last-modified timestamps on each record so the
individual webservers can grab just the data that was changed since they
last ran an index update. My concern still is that the effort of
indexing (which is potentially quite
high) is being duplicated across all the webservers.

Is there any reason why it would be a bad idea to have one machine
responsible for grabbing updates and adding documents to a master index,
so the other servers could periodically grab a copy of that index and
hot-swap it with their previous copy? Is Lucene capable of handling that
scenario? Seems to me that this approach would reduce the stress on a
webservers even more, and even if the indexing server went down the
webservers would still have a stale index to search against. Has anyone
attempted something like this?


"David Medinets" <medined@mtolive.com> wrote in message
news:059601c33a3d$423547f0$6722a8c0@medined01...
> ----- Original Message -----
> From: "Chris Miller" <chris_overseas@hotmail.com>
> > Did you look at having just a single process that was responsible 
> > for updating the index, and then pushing copies out to all the 
> > webservers?
I'm
> > wondering if that might be worth investigating (since it would take 
> > a
lot
> of
> > load off the webservers that are running the searches), or if it 
> > will be
> too
> > troublesome in practice.
>
> I've found that pulling information from a central source is simpler 
> than pushing information. When information is pushing, there is much 
> administration on the central server to track the recipient machines. 
> It seems like servers are added and dropped from the push list. 
> Additionally, you need to account for servers that stop responding. 
> When information is pulled from the central source, these issues of 
> coordination are
eliminated.
>
> David Medinets
> http://www.codebits.com




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message