lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nader S. Henein" <>
Subject RE: commercial websites powered by Lucene?
Date Tue, 24 Jun 2003 09:53:34 GMT
I have to store the information I am indexing in the database because
the nature of our application requires it, on update of certain columns
in a table I create an XML file which is then copied to directories on
each of my web servers, then separate lucene apps, running on separates
machines digest the information into separate indices, you also have to
provide procedures that will run periodically to ensure that all you
indices are in sync with each other and in sync with the DB ( I run this
once every three days when the CPU usage on the machines is low) 

To update the index I have a servlet running off a scheduler in Resin
(you could use any webserver, Orion's cool too), the up-side to
distributing your search engines like this is that you have three active
back ups in case one got corrupted (hasn't happened in two years), and
the load on each machine is pretty low even during updates/optimizations
every 20 minutes.

If the server crashes, it's not  a problem unless it happens
mid-indexing, then you have to somehow remove the write locks created in
the index directory ( I just delete them, optimize, and re-start the
update that crashed) 

Lucene destroyed Oracle on speed tests and we use to have to use our
single DB monster machine for all the searching and indexing which made
the load on it pretty high, but now I have 0.5 loads on all my CPUs and
no need to buy new hardware

-----Original Message-----
From: news [] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 1:12 PM
Subject: Re: commercial websites powered by Lucene?

So you have a holding table in a database (or directory on disk?) where
you store the incoming documents correct? Does each webserver run it's
own indexing thread which grabs any new documents every 20 minutes, or
is there a central process that manages that? I'm trying to understand
how you know when you can safely clean out the holding table.

Did you look at having just a single process that was responsible for
updating the index, and then pushing copies out to all the webservers?
I'm wondering if that might be worth investigating (since it would take
a lot of load off the webservers that are running the searches), or if
it will be too troublesome in practice.

Also, I'm interested to see how you handle the situation when a server
gets shutdown/restarted - does it just take a copy of the index from one
of the other servers (since it's own index is likely out of date)? I
take it it's not safe to copy an index while it is being updated, so you
have to block on that somehow?

PS: It's great to hear Lucene blows Oracle out of the water! I've got
some skeptical management that need some convincing, hearing stories
like this helps a lot :-)

"Nader S. Henein" <> wrote in message
> I handle updates or inserts the same way first I delete the document 
> from the index and then I insert it (better safe than sorry), I batch 
> my updates/inserts every twenty minutes, I would do it in smaller 
> intervals but since I have to sync the XML files created from the DB 
> to three machines (I maintain three separate Lucene indices on my 
> three separate
> web-servers) it takes a little longer. You have to batch your changes
> because Updating the index takes time as opposed to deleted which I
> batch every two minutes. You won't have a problem updating the index
> searching at the same time because lucene updates the index on a
> separate set of files and then when It's done it overwrites the old
> version. I've had to provide for Backups, and things like server
> mid-indexing, but I was using Oracle Intermedia before and Lucene

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message