Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
DomainKey-Status: good
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=sHL12MivLUW/YeHc3raBs2MXaA28TAgy2EpzD401By0hOUsTEBBRGL1G7dN0YEA2a+sEzIeWkHb9/MYT6c3SC6cg4QFdVY7gNzyY7B4ZWJ5U0RlAN9bvSL+J1Dr1LNYmlR1IaKarNlPa/WLt2WhwY9jwsJo33jfILuYE5jdHivc=
Message-ID: <6e3ae6310609281525v11c4a688pec29b59d4d227ea4@mail.gmail.com>
Date: Thu, 28 Sep 2006 15:25:44 -0700
From: "Chris Lu" <chris.lu@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Indexing large index with Lucene
In-Reply-To: <359a92830609280919o1ec125f6ud7d577c83d138f16@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <451BD77E.50800@hauk-sasko.de>
	 <359a92830609280919o1ec125f6ud7d577c83d138f16@mail.gmail.com>

I like the approach in your second point. But I have doubt on the first poi=
nt.

For a production level index, usually pretty big, freqent close/reopen
the searcher may not be fast enough, especially when you want to cache
sorting. It's better to keep the searchers open. But when the indexing
process is going on, the files are changing. The searcher's segment
information will be outdated and read EOF exceptions will happen.

So for a big index, it's better to keep two copies of index, one for
searching, one for indexing. And hot-swapping them when indexing is
done. This is what we did in DBSight. No read EOF exceptions or
corrupted indexes any more.

Chris Lu
---------------------------
Full-Text Search on Any Applications/Databases
http://www.dbsight.net

On 9/28/06, Erick Erickson <erickerickson@gmail.com> wrote:
> Two things come to mind...
>
> First, you can freely write to an index while searching it, the search is
> always available. I'm pretty sure this includes deleting/readding documen=
ts.
> However, you won't be able to search on the changes in your index until y=
ou
> close/reopen the *searcher*.
>
> Second, depending on how quickly you need updates, you could always make =
a
> *copy* of your index, update that and then move it back to where your
> searcher looks for it, sort of a batch process really. It all depends upo=
n
> how quickly you require seeing the changes.
>
> Hope this helps
> Erick
>
> On 9/28/06, Eric Louvard <eric.louvard@hauk-sasko.de> wrote:
> >
> > I'm using Lucene since several year. We had to index allways more
> > documents.
> >
> > I'm now trying to optimise the index process with more than 1.000.000
> > documents and I can see that the performance will decrease when the
> > index size is greater.
> > I would like to know if someone as allready studied this case.
> >
> > It's interactively maintained index and the fisrt index process is my
> > biggest Problem.
> >
> > - A document contains several attributs.
> > - I can't block the index during the index process (the search must
> > allways be availlable).
> > - I need to delete the older version of document if I become an newer.
> >
> > Thank you to tell me about you personnal experience.
> >
> > =C9ric Louvard.
> >
> > --
> > Mit freundlichen Gr=FC=DFen
> >
> > i. A. =C9ric Louvard
> > HAUK & SASKO Ingenieurgesellschaft mbH
> > Zettachring 2
> > D-70567 Stuttgart
> >
> > Phone: +49 7 11 7 25 89 - 19
> > Fax: +49 7 11 7 25 89 - 50
> > E-Mail: eric.louvard@hauk-sasko.de
> > www: www.hauk-sasko.de
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org