Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 98830 invoked from network); 16 Sep 2002 18:11:47 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 16 Sep 2002 18:11:47 -0000 Received: (qmail 16151 invoked by uid 97); 16 Sep 2002 18:12:15 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 16074 invoked by uid 97); 16 Sep 2002 18:12:13 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 16062 invoked by uid 98); 16 Sep 2002 18:12:12 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: From: John Cwikla To: 'Lucene Users List' Subject: RE: Abusing lucene and way too many files open Date: Mon, 16 Sep 2002 11:11:30 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hmm, I didn't know this was possible, thanks for the info on this one. That would take me down to only 20000 files open :) But that is doable. -----Original Message----- From: Alex Murzaku [mailto:murzaku@yahoo.com] Sent: Sunday, September 15, 2002 12:29 PM To: Lucene Users List Subject: Re: Abusing lucene and way too many files open I haven't played with this, but I was under the impression that you only need to worry that same data are indexed and queried using the same analyzer. If the language is always included in the query (or indexing) the data would be virtually partitioned by language. As for the analyzer, you just need to know before submitting the query or indexing what is the language you will be using and calling the corresponding analyzer. Now, I don't want to get into too much detail because I haven't tried this myself. It just seems logical... --- John L Cwikla wrote: > Interestingly, it's really not about the size of the dataset, it's a > question > of partitioning the data in the dataset. > > There are two reasons for not including the account and language in > the > record. For the language, we have different analyzers for each > language > so they require their own index. For the accounts, it's a question > of > stability > and ease of flushing/adding/removing some or all the records, > constantly. > Putting the account number in the record instead of having an index > per > account > is doable, and I am leaning toward that solution in the interim, > having 1 > index > per language, but I worry about > the downtime to other accounts when I need to do some major deletion > to > one (or some) of the accounts as they get locked out during any > merging/optimizing > phase, and also if something goes wrong I'd still like to limit it to > 1 > account (hopefully) > instead of having to reindex 100. > > > ----- Original Message ----- > From: "Alex Murzaku" > To: "Lucene Users List" > Sent: Sunday, September 15, 2002 6:18 AM > Subject: Re: Abusing lucene and way too many files open > > > > > So after roughly crunching some numbers, with a mergeFactor set > to > > > the default of 10: > > > > > > (7 files per segment + 12 for fields) * (up to 10 segments) * 100 > > > accounts * 10 langauges > > > > > > = 200,000 open files at once. Ouch. > > > > I have read accounts of people in this list dealing with data sets > > larger than what you describe. There is one thing that you said you > > were considering: why not include both account and language in the > > record. Unless planning to distribute them over several machines, I > > wouldn't create artificially 100*10 indices. In you case, when > > indexing, I would have instead 14 fields * 7 files * 10 segments > open > > files... I guess you already have resolved the problem of > identifying > > the language and you have built the corresponding analyzers which > you > > call when indexing or querying in a given language. Just a thought. > > > > > > ===== > > __________________________________ > > alex@lissus.com -- http://www.lissus.com > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! News - Today's headlines > > http://news.yahoo.com > > > > -- > > To unsubscribe, e-mail: > > > For additional commands, e-mail: > > > > > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > __________________________________________________ Do you Yahoo!? Yahoo! News - Today's headlines http://news.yahoo.com -- To unsubscribe, e-mail: For additional commands, e-mail: -- To unsubscribe, e-mail: For additional commands, e-mail: