From lucene-user-return-4853-qmlist-jakarta-archive-lucene-user=nagoya.apache.org@jakarta.apache.org Tue Jun 24 10:44:11 2003 Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 17862 invoked from network); 24 Jun 2003 10:44:09 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 24 Jun 2003 10:44:09 -0000 Received: (qmail 2159 invoked by uid 97); 24 Jun 2003 10:46:30 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 2152 invoked from network); 24 Jun 2003 10:46:29 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 24 Jun 2003 10:46:29 -0000 Received: (qmail 17549 invoked by uid 500); 24 Jun 2003 10:44:07 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 17535 invoked from network); 24 Jun 2003 10:44:07 -0000 Received: from main.gmane.org (80.91.224.249) by daedalus.apache.org with SMTP; 24 Jun 2003 10:44:07 -0000 Received: from list by main.gmane.org with local (Exim 3.35 #1 (Debian)) id 19UlHC-0008Qa-00 for ; Tue, 24 Jun 2003 12:43:50 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: lucene-user@jakarta.apache.org Received: from news by main.gmane.org with local (Exim 3.35 #1 (Debian)) id 19UlHC-0008QR-00 for ; Tue, 24 Jun 2003 12:43:50 +0200 From: Ulrich Mayring Subject: Re: commercial websites powered by Lucene? Date: Tue, 24 Jun 2003 12:41:52 +0200 Lines: 30 Message-ID: References: <85256D3B.004CC107.00@corpnj148ls01.mcgraw-hill.com> <001401c32b38$32aa2440$d501a8c0@naderit> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@main.gmane.org User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826 X-Accept-Language: de-de, en-us, en Sender: news X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Chris Miller wrote: > > The main thing I'm interested in is how you handle updates to Lucene's > index. I'd imagine you have a fairly high turnover of CVs and jobs, so index > updates must place a reasonable load on the CPU/disk. Do you keep CVs and > jobs in the same index or two different ones? And what is the process you > use to update the index(es) - do you batch-process updates or do you handle > them in real-time as changes are made? The way we do it: we re-index everything periodically in a temporary directory and then rename the temporary directory. That way the index remains accessible at all times and its currency is simply determined by the interval I run the re-indexing in. > We need to be able to handle indexing about 60,000 documents/day, > while allowing (many) searches to continue operating alongside. On an entry-level Sun I can index about 23 documents per second and these are real-life HTML pages. Thus in less than one hour you would be finished with a complete index run and save yourself all kinds of trouble with crashes during indexing etc. On my 2 GHz Linux workstation it's even faster: more than 2000 documents per minute, so you'd be done in half an hour. BTW, we're not using the supplied JavaCC-based HTML parser, instead we got htmlparser.sourceforge.net, which is a joy to use and pretty fast. Ulrich --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org