Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 82850 invoked from network); 15 Sep 2002 18:03:53 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 15 Sep 2002 18:03:53 -0000 Received: (qmail 4665 invoked by uid 97); 15 Sep 2002 18:04:30 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 4621 invoked by uid 97); 15 Sep 2002 18:04:30 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 4609 invoked by uid 98); 15 Sep 2002 18:04:29 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <001501c25ce2$379a7970$1402a8c0@ICEBOXER.local> From: "John L Cwikla" To: "Lucene Users List" References: <20020915131856.11066.qmail@web11905.mail.yahoo.com> Subject: Re: Abusing lucene and way too many files open Date: Sun, 15 Sep 2002 11:03:37 -0700 Organization: Biz360 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Interestingly, it's really not about the size of the dataset, it's a question of partitioning the data in the dataset. There are two reasons for not including the account and language in the record. For the language, we have different analyzers for each language so they require their own index. For the accounts, it's a question of stability and ease of flushing/adding/removing some or all the records, constantly. Putting the account number in the record instead of having an index per account is doable, and I am leaning toward that solution in the interim, having 1 index per language, but I worry about the downtime to other accounts when I need to do some major deletion to one (or some) of the accounts as they get locked out during any merging/optimizing phase, and also if something goes wrong I'd still like to limit it to 1 account (hopefully) instead of having to reindex 100. ----- Original Message ----- From: "Alex Murzaku" To: "Lucene Users List" Sent: Sunday, September 15, 2002 6:18 AM Subject: Re: Abusing lucene and way too many files open > > So after roughly crunching some numbers, with a mergeFactor set to > > the default of 10: > > > > (7 files per segment + 12 for fields) * (up to 10 segments) * 100 > > accounts * 10 langauges > > > > = 200,000 open files at once. Ouch. > > I have read accounts of people in this list dealing with data sets > larger than what you describe. There is one thing that you said you > were considering: why not include both account and language in the > record. Unless planning to distribute them over several machines, I > wouldn't create artificially 100*10 indices. In you case, when > indexing, I would have instead 14 fields * 7 files * 10 segments open > files... I guess you already have resolved the problem of identifying > the language and you have built the corresponding analyzers which you > call when indexing or querying in a given language. Just a thought. > > > ===== > __________________________________ > alex@lissus.com -- http://www.lissus.com > > __________________________________________________ > Do you Yahoo!? > Yahoo! News - Today's headlines > http://news.yahoo.com > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > -- To unsubscribe, e-mail: For additional commands, e-mail: