lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Wang" <john.w...@gmail.com>
Subject Re: realtime indexing
Date Fri, 16 Nov 2007 12:05:38 GMT
Thanks Kay.

I am doing exactly what you are saying.

Just to elaborate:

So whatever is submitted to the RAM index is always the latest, any
deletes (an update is a delete + an add) submitted to the any of the
ram indexes is recorded (discarded when the ram index is discarded)
with the uid.

That "delete list" is passed onto the searcher handling the disk
index. In the hitCollector, we do a quickly look up of uid given a
docid and then check to see if that uid is in the deleted list and
discard if it is. (actually in reality, you can have your own searcher
implementation and check before score is called to avoid unnecc
scoring computation if you expect the delete list to be large)

For us, we've implemented a way (by hacking into the lucene guts) to
be able to lookup a uid very fast (amounts to an array lookup). and
then the check is just an integer hash lookup (our uid is an integer)

(I started a thread on the dev list on how to quickly lookup primary
id (uid) given a lucene doc id.)


Hope this helps.

-John

On Nov 16, 2007 2:59 AM, Antoine Baudoux <ab@taktik.be> wrote:
>         Hi,
>
>         I'm trying to implement a similar solution.
>
>
>         Could you be more precise on how you handle duplicates, as well as
> document deletion?
>
>
>         Thx,
>
>
> Antoine
>
>
> On Nov 16, 2007, at 7:44 AM, John Wang wrote:
>
> > Hi:
> >
> >    It was interesting hearing about the need for real time indexing
> > at the BirdsOfAFeather round table. We also needed to solve this
> > problem. We took this approach:
> >
> > A large disk index that indexes in batch, e.g. sleeps for some time
> > queue up requests, wakes up and the index.
> > While large disk index is sleeping, same requests are also added to a
> > ram index, and when disk indexer is working, requests received is
> > added to another ram index.
> >
> > When new disk index is published, the first ram index points to the
> > secondary ram index, and the secondary ram index is flushed.
> >
> > we keep 1 index reader open for the disk index, and create new
> > indexReaders for the ram indexes per request (it seems to be ok
> > because the ram indexes are small)
> >
> > We use MultiSearcher across these readers.
> >
> > duplicates are also handled with our scheme.
> >
> > I am curious to see if anyone else is trying this. It would be
> > interesting to hear comments from the experts.
> >
> > Thanks
> >
> > -John
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message