lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Future projects
Date Thu, 02 Apr 2009 08:40:53 GMT
On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
<jason.rutherglen@gmail.com> wrote:
> Now that LUCENE-1516 is close to being committed perhaps we can
> figure out the priority of other issues:
>
> 1. Searchable IndexWriter RAM buffer

I think first priority is to get a good assessment of the performance
of the current implementation (from LUCENE-1516).

My initial tests are very promising: with a writer updating (replacing
random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
was able to get reopen the reader once per second and do a large (>
500K results) search that sorts by date.  The reopen time was
typically ~40 msec, and search time typically ~35 msec (though there
were random spikes up to ~340 msec).  Though, these results were on an
SSD (Intel X25M 160 GB).

We need more datapoints of the current approach, but this looks likely
to be good enough for starters.  And since we can get it into 2.9,
hopefully it'll get some early usage and people will report back to
help us assess whether further performance improvements are necessary.

If they do turn out to be necessary, I think before your step 1, we
should write small segments into a RAMDirectory instead of the "real"
directory.  That's simpler than truly searching IndexWriter's
in-memory postings data.

> 2. Finish up benchmarking and perhaps implement passing
> filters to the SegmentReader level

What is "passing filters to the SegmentReader level"?  EG as of
LUCENE-1483, we now ask a Filter for it's DocIdSet once per
SegmentReader.

> 3. Deleting by doc id using IndexWriter

We need a clean approach for the "docIDs suddenly shift when merge is
committed" problem for this...

Thinking more on this... I think one possible solution may be to
somehow expose IndexWriter's internal docID remapping code.
IndexWriter does delete by docID internally, and whenever a merge is
committed we stop-the-world (sync on IW) and go remap those docIDs.
If we somehow allowed user to register a callback that we could call
when this remapping occurs, then user's code could carry the docIDs
without them becoming stale.  Or maybe we could make a class
"PendingDocIDs", which you'd ask the reader to give you, that holds
docIDs and remaps them after each merge.  The problem is, IW
internally always logically switches to the current reader for any
further docID deletion, but the user's code may continue to use an old
reader.  So simply exposing this remapping won't fix it... we'd need
to somehow track the genealogy (quite a bit more complex).

> With 1) I'm interested in how we will lock a section of the
> bytes for use by a given reader? We would not actually lock
> them, but we need to set aside the bytes such that for example
> if the postings grows, TermDocs iteration does not progress to
> beyond it's limits. Are there any modifications that are needed
> of the RAM buffer format? How would the term table be stored? We
> would not be using the current hash method?

I think the realtime reader'd just store the maxDocID it's allowed to
search, and we would likely keep using the RAM format now used.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message