lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <>
Subject [jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
Date Tue, 16 Mar 2010 15:46:27 GMT


Jason Rutherglen commented on LUCENE-2312:

I thought we're moving away from byte block pooling and we're
going to try relying on garbage collection? Does a volatile
object[] publish changes to all threads? Probably not, again
it'd just be the pointer.

In the case of posting/termdocs iteration, I'm more concerned
that the lastDocID be volatile than the with the byte array
containing extra data. Extra docs is OK in the byte array
because we'll simply stop iterating when we've reached the last
doc. Though with our system, we shouldn't even run into this
either, meaning a byte array is copied and published, perhaps
the master byte array is still being written to and the same
byte array (by id or something) is published again? Then we'd
have multiple versions of byte arrays. That could be bad.

Because there is one DW per thread, there's only one document
being indexed at a time. There's no writer concurrency. This
leaves reader concurrency. However after each doc, we *could*
simply flush all bytes related to the doc. Any new docs must
simply start writing to new byte arrays? The problem with this
is, unless the byte arrays are really small, we'll have a lot of
extra data around, well, unless the byte arrays are trimmed
before publication. Or we can simply RW lock (or some other
analogous thing) individual byte arrays, not publish them after
each doc, then only publish them when get reader is called. To
clarify, the RW lock (or flag) would only be per byte array, in
fact, all writing to the byte array could necessarily cease on
flush, and new byte arrays allocated. The published byte array
could point to the next byte array. 

I think we simply need a way to publish byte arrays to all
threads? Michael B. can you post something of what you have so
we can get an idea of how your system will work (ie, mainly what
the assumptions are)? 

We do need to strive for correctness of data, and perhaps
performance will be slightly impacted (though compared with our
current NRT we'll have an overall win). 

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>                 Key: LUCENE-2312
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Jason Rutherglen
>            Assignee: Michael Busch
>             Fix For: 3.1
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max doc ids.
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message