Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <001501c25ce2$379a7970$1402a8c0@ICEBOXER.local>
From: "John L Cwikla" <cwikla@biz360.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
References: <20020915131856.11066.qmail@web11905.mail.yahoo.com>
Subject: Re: Abusing lucene and way too many files open
Date: Sun, 15 Sep 2002 11:03:37 -0700
Organization: Biz360
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Interestingly, it's really not about the size of the dataset, it's a
question
of partitioning the data in the dataset.

There are two reasons for not including the account and language in the
record. For the language, we have different analyzers for each language
so they require their own index.  For the accounts, it's a question of
stability
and ease of flushing/adding/removing some or all the records, constantly.
Putting the account number in the record instead of having an index per
account
is doable, and I am leaning toward that solution in the interim, having 1
index
per language, but I worry about
the downtime to other accounts when I need to do some major deletion to
one (or some) of the accounts as they get locked out during any
merging/optimizing
phase, and also if something goes wrong I'd still like to limit it to 1
account (hopefully)
instead of having to reindex 100.


----- Original Message -----
From: "Alex Murzaku" <murzaku@yahoo.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Sunday, September 15, 2002 6:18 AM
Subject: Re: Abusing lucene and way too many files open


> > So after roughly crunching some numbers, with a mergeFactor set to
> > the default of 10:
> >
> > (7 files per segment + 12 for fields) * (up to 10 segments) * 100
> > accounts * 10 langauges
> >
> > = 200,000 open files at once. Ouch.
>
> I have read accounts of people in this list dealing with data sets
> larger than what you describe. There is one thing that you said you
> were considering: why not include both account and language in the
> record. Unless planning to distribute them over several machines, I
> wouldn't create artificially 100*10 indices. In you case, when
> indexing, I would have instead 14 fields * 7 files * 10 segments open
> files... I guess you already have resolved the problem of identifying
> the language and you have built the corresponding analyzers which you
> call when indexing or querying in a given language. Just a thought.
>
>
> =====
> __________________________________
> alex@lissus.com -- http://www.lissus.com
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! News - Today's headlines
> http://news.yahoo.com
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>