Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 7612 invoked from network); 24 Jun 2003 09:55:54 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 24 Jun 2003 09:55:54 -0000 Received: (qmail 960 invoked by uid 97); 24 Jun 2003 09:58:28 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 953 invoked from network); 24 Jun 2003 09:58:28 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 24 Jun 2003 09:58:28 -0000 Received: (qmail 97324 invoked by uid 500); 24 Jun 2003 09:53:47 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 82653 invoked from network); 24 Jun 2003 09:47:07 -0000 Received: from unknown (HELO mail-out.dubaiinternetcity.net) (213.132.36.135) by daedalus.apache.org with SMTP; 24 Jun 2003 09:47:07 -0000 Received: from naderit ([213.132.40.2]) by dicisp003b.dic.sys (iPlanet Messaging Server 5.2 Patch 1 (built Aug 19 2002)) with ESMTP id <0HGZ00L67CIVJS@dicisp003b.dic.sys> for lucene-user@jakarta.apache.org; Tue, 24 Jun 2003 13:47:19 +0400 (GMT) Date: Tue, 24 Jun 2003 13:53:34 +0400 From: "Nader S. Henein" Subject: RE: commercial websites powered by Lucene? In-reply-to: To: 'Lucene Users List' Reply-to: nsh@bayt.net Message-id: <001801c33a36$7a2cb0a0$1801a8c0@naderit> Organization: Bayt.com MIME-version: 1.0 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Mailer: Microsoft Outlook, Build 10.0.4024 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Importance: Normal X-Priority: 3 (Normal) X-MSMail-priority: Normal X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I have to store the information I am indexing in the database because the nature of our application requires it, on update of certain columns in a table I create an XML file which is then copied to directories on each of my web servers, then separate lucene apps, running on separates machines digest the information into separate indices, you also have to provide procedures that will run periodically to ensure that all you indices are in sync with each other and in sync with the DB ( I run this once every three days when the CPU usage on the machines is low) To update the index I have a servlet running off a scheduler in Resin (you could use any webserver, Orion's cool too), the up-side to distributing your search engines like this is that you have three active back ups in case one got corrupted (hasn't happened in two years), and the load on each machine is pretty low even during updates/optimizations every 20 minutes. If the server crashes, it's not a problem unless it happens mid-indexing, then you have to somehow remove the write locks created in the index directory ( I just delete them, optimize, and re-start the update that crashed) Lucene destroyed Oracle on speed tests and we use to have to use our single DB monster machine for all the searching and indexing which made the load on it pretty high, but now I have 0.5 loads on all my CPUs and no need to buy new hardware -----Original Message----- From: news [mailto:news@main.gmane.org] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 1:12 PM To: lucene-user@jakarta.apache.org Subject: Re: commercial websites powered by Lucene? So you have a holding table in a database (or directory on disk?) where you store the incoming documents correct? Does each webserver run it's own indexing thread which grabs any new documents every 20 minutes, or is there a central process that manages that? I'm trying to understand how you know when you can safely clean out the holding table. Did you look at having just a single process that was responsible for updating the index, and then pushing copies out to all the webservers? I'm wondering if that might be worth investigating (since it would take a lot of load off the webservers that are running the searches), or if it will be too troublesome in practice. Also, I'm interested to see how you handle the situation when a server gets shutdown/restarted - does it just take a copy of the index from one of the other servers (since it's own index is likely out of date)? I take it it's not safe to copy an index while it is being updated, so you have to block on that somehow? PS: It's great to hear Lucene blows Oracle out of the water! I've got some skeptical management that need some convincing, hearing stories like this helps a lot :-) "Nader S. Henein" wrote in message news:000b01c33a2a$d2675290$1801a8c0@naderit... > I handle updates or inserts the same way first I delete the document > from the index and then I insert it (better safe than sorry), I batch > my updates/inserts every twenty minutes, I would do it in smaller > intervals but since I have to sync the XML files created from the DB > to three machines (I maintain three separate Lucene indices on my > three separate > web-servers) it takes a little longer. You have to batch your changes > because Updating the index takes time as opposed to deleted which I > batch every two minutes. You won't have a problem updating the index and > searching at the same time because lucene updates the index on a > separate set of files and then when It's done it overwrites the old > version. I've had to provide for Backups, and things like server crashes > mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS > IT AWAY. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org