Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 1958 invoked from network); 6 Jul 2006 19:09:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Jul 2006 19:09:45 -0000 Received: (qmail 4017 invoked by uid 500); 6 Jul 2006 19:09:41 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 3130 invoked by uid 500); 6 Jul 2006 19:09:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 3119 invoked by uid 99); 6 Jul 2006 19:09:39 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jul 2006 12:09:39 -0700 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_WHOIS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [66.163.178.105] (HELO web34107.mail.mud.yahoo.com) (66.163.178.105) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 06 Jul 2006 12:09:38 -0700 Received: (qmail 50986 invoked by uid 60001); 6 Jul 2006 19:09:17 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=zJYaX5Db1Mo7+QHzhQLI2uu520gBJ3pjGirdzwv7P774qRafzEI6/DA1gCvqvh2EID9cK6+JDFk9lZSEtIDQARRHTYcIx6kpTLo/pSVVTzNEWD5mVSx4ErthYH6aNBLtf7FodJ0aMkNwkLW0Y6uHmTXnSYXOYJOuOmIOo2eS6VE= ; Message-ID: <20060706190917.50984.qmail@web34107.mail.mud.yahoo.com> Received: from [63.206.223.196] by web34107.mail.mud.yahoo.com via HTTP; Thu, 06 Jul 2006 12:09:17 PDT Date: Thu, 6 Jul 2006 12:09:17 -0700 (PDT) From: James Pine Subject: RE: Managing a large archival (and constantly changing) database To: java-user@lucene.apache.org In-Reply-To: <15FC71D155A0F8429C3636B72C2ED52E19BF73@hermes.laddersoffice.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hey, I found this thread to be very useful when deciding upon an indexing strategy. http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.html The system I work on has 3 million or so documents and it was (until a non-lucene performance issue came up) setup to add/delete new documents every 15 minutes in a similar manner as described in the thread. We were adding/deleting a few thousand documents every 15 minutes, during peak traffic. We have a dedicated indexing machine and distribute portions of our index across multiple machines, but you could still follow the pattern all on one box, just with separate processes/threads. Even though lucene allows certain types of index operations to happen concurrently with search activity, IMHO, if you can decouple the indexing process from the searching process your system as a whole will be more flexible and scalable with only a little extra maintenance overhead. JAMES --- Larry Ogrodnek wrote: > We have a similar setup, although probably only > 1/5th the number of > documents and updates. I'd suggest just making > periodic index backups. > > I've been storing my index as follows: > > //data/ (lucene index > directory) > //backups/ > > The "data" is what's passed into > IndexWriter/IndexReader. Additionally, > I create/update a .last_update file, which just > contains the timestamp > of when the last update was started, so when the app > starts up it only > needs to retrieve updates from the db since then. > > Periodically the app copies the contents of data > into a new directory in > backups named by the date/time, e.g. > backups/2007-07-04.110051. If > needed, I can delete data and replace the contents > with the latest > backup, and the app will only retrieve records > updated since the backup > was made (using the backup's .last_update)... > > I'd recommend making the complete index creation > from scratch a normal > operation as much as possible (but you're right, for > that number of > documents it will take awhile). It's been really > helpful here when > doing additional deploys for testing, or deciding we > want to index > things differently, etc... > > -larry > > > -----Original Message----- > From: Scott Smith [mailto:ssmith@mainstreamdata.com] > > Sent: Thursday, July 06, 2006 1:48 PM > To: lucene-user@jakarta.apache.org > Subject: Managing a large archival (and constantly > changing) database > > I've been asked to do a project which provides > full-text search for a > large database of articles. The expectation is that > most of the > articles are fairly small (<2k bytes). There will > be an initial > population of around 400,000 articles. There will > then be approximately > 2000 new articles added each day (they need to be > added in "real time" > (within a few minutes of arrival), but will be > spread out during the > day). So, roughly another 700,000 articles each > year. > > > > I've read enough to believe that having a lucene > database of several > million articles is doable. And, adding 2000 > articles per day wouldn't > seem to be that many. My concern is the real-time > nature of the > application. I'm a bit nervous (perhaps without > justification) at > simply growing one monolithic lucene database. > Should there be a crash, > the database will be unusable and I'll have to > rebuild from scratch > (which, based on my experience, would be hours of > time). > > > > Some of my thoughts were: > > 1) having monthly databases and using > MultiSearcher to search across > them. That way my exposure for a corrupted database > is limited to this > month's database. This would also seem to give me > somewhat better > control--meaning a) if the search was generating > lots of hits, I could > display the results a month at a time and not bury > them with output. It > would also spread their search CPU out better and > not prevent other > individuals from doing a search. If there were very > few results, I > could sleep between each month's search and again, > not lock everyone > else out from searches. > > 2) Have a "this month's" searchable and an > "everything else" > searchable. At the beginning of each month, I would > consolidate the > previous month's database into the "everything else" > searchable. This > would give more consistent results for relevancy > ranked searches. But, > it means that a bad search could return lots of > results. > > > > Has anyone else dealt with a similar problem? Am I > expecting too much > from Lucene running on a single machine (or should I > be looking at > Hadoop?). Any comments or links to previous > discussions on this topic > would be appreciated. > > > > Scott > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: > java-user-help@lucene.apache.org > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org