On Wed, Apr 1, 2009 at 7:05 PM, Jason RutherglenI think first priority is to get a good assessment of the performance
> Now that LUCENE-1516 is close to being committed perhaps we can
> figure out the priority of other issues:
> 1. Searchable IndexWriter RAM buffer
of the current implementation (from LUCENE-1516).
My initial tests are very promising: with a writer updating (replacing
random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
was able to get reopen the reader once per second and do a large (>
500K results) search that sorts by date. The reopen time was
typically ~40 msec, and search time typically ~35 msec (though there
were random spikes up to ~340 msec). Though, these results were on an
SSD (Intel X25M 160 GB).
We need more datapoints of the current approach, but this looks likely
to be good enough for starters. And since we can get it into 2.9,
hopefully it'll get some early usage and people will report back to
help us assess whether further performance improvements are necessary.
If they do turn out to be necessary, I think before your step 1, we
should write small segments into a RAMDirectory instead of the "real"
directory. That's simpler than truly searching IndexWriter's
in-memory postings data.
What is "passing filters to the SegmentReader level"? EG as of
> 2. Finish up benchmarking and perhaps implement passing
> filters to the SegmentReader level
LUCENE-1483, we now ask a Filter for it's DocIdSet once per
We need a clean approach for the "docIDs suddenly shift when merge is
> 3. Deleting by doc id using IndexWriter
committed" problem for this...
Thinking more on this... I think one possible solution may be to
somehow expose IndexWriter's internal docID remapping code.
IndexWriter does delete by docID internally, and whenever a merge is
committed we stop-the-world (sync on IW) and go remap those docIDs.
If we somehow allowed user to register a callback that we could call
when this remapping occurs, then user's code could carry the docIDs
without them becoming stale. Or maybe we could make a class
"PendingDocIDs", which you'd ask the reader to give you, that holds
docIDs and remaps them after each merge. The problem is, IW
internally always logically switches to the current reader for any
further docID deletion, but the user's code may continue to use an old
reader. So simply exposing this remapping won't fix it... we'd need
to somehow track the genealogy (quite a bit more complex).
I think the realtime reader'd just store the maxDocID it's allowed to
> With 1) I'm interested in how we will lock a section of the
> bytes for use by a given reader? We would not actually lock
> them, but we need to set aside the bytes such that for example
> if the postings grows, TermDocs iteration does not progress to
> beyond it's limits. Are there any modifications that are needed
> of the RAM buffer format? How would the term table be stored? We
> would not be using the current hash method?
search, and we would likely keep using the RAM format now used.
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com