lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <>
Subject Re: Future projects
Date Fri, 03 Apr 2009 21:42:07 GMT
> I think the realtime reader'd just store the maxDocID it's allowed to
search, and we would likely keep using the RAM format now used.

Sounds pretty good.  Are there any other gotchas in the design?

On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless <> wrote:

> On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
> <> wrote:
> > Now that LUCENE-1516 is close to being committed perhaps we can
> > figure out the priority of other issues:
> >
> > 1. Searchable IndexWriter RAM buffer
> I think first priority is to get a good assessment of the performance
> of the current implementation (from LUCENE-1516).
> My initial tests are very promising: with a writer updating (replacing
> random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
> was able to get reopen the reader once per second and do a large (>
> 500K results) search that sorts by date.  The reopen time was
> typically ~40 msec, and search time typically ~35 msec (though there
> were random spikes up to ~340 msec).  Though, these results were on an
> SSD (Intel X25M 160 GB).
> We need more datapoints of the current approach, but this looks likely
> to be good enough for starters.  And since we can get it into 2.9,
> hopefully it'll get some early usage and people will report back to
> help us assess whether further performance improvements are necessary.
> If they do turn out to be necessary, I think before your step 1, we
> should write small segments into a RAMDirectory instead of the "real"
> directory.  That's simpler than truly searching IndexWriter's
> in-memory postings data.
> > 2. Finish up benchmarking and perhaps implement passing
> > filters to the SegmentReader level
> What is "passing filters to the SegmentReader level"?  EG as of
> LUCENE-1483, we now ask a Filter for it's DocIdSet once per
> SegmentReader.
> > 3. Deleting by doc id using IndexWriter
> We need a clean approach for the "docIDs suddenly shift when merge is
> committed" problem for this...
> Thinking more on this... I think one possible solution may be to
> somehow expose IndexWriter's internal docID remapping code.
> IndexWriter does delete by docID internally, and whenever a merge is
> committed we stop-the-world (sync on IW) and go remap those docIDs.
> If we somehow allowed user to register a callback that we could call
> when this remapping occurs, then user's code could carry the docIDs
> without them becoming stale.  Or maybe we could make a class
> "PendingDocIDs", which you'd ask the reader to give you, that holds
> docIDs and remaps them after each merge.  The problem is, IW
> internally always logically switches to the current reader for any
> further docID deletion, but the user's code may continue to use an old
> reader.  So simply exposing this remapping won't fix it... we'd need
> to somehow track the genealogy (quite a bit more complex).
> > With 1) I'm interested in how we will lock a section of the
> > bytes for use by a given reader? We would not actually lock
> > them, but we need to set aside the bytes such that for example
> > if the postings grows, TermDocs iteration does not progress to
> > beyond it's limits. Are there any modifications that are needed
> > of the RAM buffer format? How would the term table be stored? We
> > would not be using the current hash method?
> I think the realtime reader'd just store the maxDocID it's allowed to
> search, and we would likely keep using the RAM format now used.
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message