Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 65677 invoked from network); 7 May 2004 06:43:07 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 7 May 2004 06:43:07 -0000 Received: (qmail 7778 invoked by uid 500); 7 May 2004 06:42:40 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 7756 invoked by uid 500); 7 May 2004 06:42:40 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 7741 invoked from network); 7 May 2004 06:42:40 -0000 Received: from unknown (HELO web12704.mail.yahoo.com) (216.136.173.241) by daedalus.apache.org with SMTP; 7 May 2004 06:42:40 -0000 Message-ID: <20040507064249.56677.qmail@web12704.mail.yahoo.com> Received: from [195.29.61.78] by web12704.mail.yahoo.com via HTTP; Thu, 06 May 2004 23:42:49 PDT Date: Thu, 6 May 2004 23:42:49 -0700 (PDT) From: Otis Gospodnetic Subject: Re: Query performance on a 315 Million document index (1TB) To: Lucene Users List In-Reply-To: <20040506234755.21F8C1CE305@ws3-6.us4.outblaze.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N That's big, and while I have not created such large indices with Lucene, I would think that disk I/O would be the biggest issue. That is why Nutch has distributed search options built in, and their demo has 'only' 100M documents. Perhaps you can mimic distributed indexing and searching approach of Nutch. Otis --- Will Allen wrote: > Hi, > I am considering a project that would index 315+ million documents. > I am comfortable that the indexing will work well in creating an > index ~800GB in size, but am concerned about the query performance. > (Is this a = bad > assumption?) > > What are the bottlenecks of performance as an index scales? Memory? > = Cost is not a concern, so what would be the shortcomings of a > theoretical = machine with 16GB of ram, 4-16 cpus and 1-2 terabytes > of space? Would it be = better to cluster machines to break apart > the query? > > Thank you for your serious responses, > Will Allen > -- > ___________________________________________________________ > Sign-up for Ads Free at Mail.com > http://promo.mail.com/adsfreejump.htm > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org