lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
Date Sat, 13 Mar 2010 15:18:27 GMT


Michael McCandless commented on LUCENE-2312:

For the terms dictionary, perhaps a terms array (this could be a
RawPostingList[], or an array of objects with pointers to a
RawPostingList with some helper methods like getTerm and
compareTo), is kept in sorted order, we then binary search and
insert new RawPostingLists/terms into the array. We could
implement a 2 dimensional array, allowing us to make a per
reader copy of the 1st dimension of array. This would maintain
transactional consistency (ie, a reader's array isn't changing
as a term enum is traversing in another thread).

I don't think we can do term insertion into an array -- that's O(N^2)
insertion cost -- we should use a btree instead.

Also, we could store the first docID stored into the term, too -- this
way we could have a ordered collection of terms, that's shared across
several open readers even as changes are still being made, but each
reader skips a given term if its first docID is greater than the
maxDoc it's searching.  That'd give us point in time searching even
while we add terms with time...

bq. Also, we have to solve what happens to a reader using a RAM segment that's been flushed.
Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using
that RAM segment have closed.

I don't think we have a choice here?

I think we do have a choice.

EG we could force the reader to cutover to the newly flushed segment
(which should be identical to the RAM segment), eg by making [say] a

Still... we'd probably have to not re-use in that case, since there
can be queries in-flight stepping through the RAM postings, and, we
have no way to accurately detect they are done.  But at least with
this approach we wouldn't tie up RAM indefinitely...

Or maybe we simply state that the APP must aggressively close NRT
readers with time else memory use grows and grows... but I don't
really like that.  We don't have such a restriction today...

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>                 Key: LUCENE-2312
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Jason Rutherglen
>             Fix For: 3.0.2
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max doc ids.
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message