lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields
Date Thu, 01 Mar 2012 21:36:00 GMT


Shai Erera commented on LUCENE-3837:

Andrzej, this brings back [old memories|]

The core difference in your proposal is that the updates are processed in a separate index,
and that at runtime we use a PQ to match documents and collapse all the updates, right? And
these updates will be reflected in the main index on segment merges, right?

I personally prefer a more integrated solution then one that's based on matching PQs, but
since I barely did something with my proposal for 2 years, I guess that your progress is better
than no progress at all.

One comment -- when the updates are collapsed, the may not just simply 'replace' what exists
before them. I could see an update to a document which adds a stored field, and therefore
if I'll call IndexReader.document(i), I'd expect to see that stored field with all the ones
that existed before it.

At the time I felt that modifying Lucene to add stacked segments is way too complicated, and
the indexing internals kept changing by the day. But now Codecs seem to be very stable, and
trunk's code changes relax, so perhaps it'll be worthwhile taking a second look at that proposal?
(but only if you feel like it)
> A modest proposal for updateable fields
> ---------------------------------------
>                 Key: LUCENE-3837
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
> I'd like to propose a simple design for implementing updateable fields in Lucene. This
design has some limitations, so I'm not claiming it will be appropriate for every use case,
and it's obvious it has some performance consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the original data
is not removed but instead it's overlaid with the new data. I propose to reuse as much of
the existing APIs as possible, and represent updates as an IndexReader. Updates to documents
in a specific segment would be collected in an "overlay" index specific to that segment, i.e.
there would be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . The document
would consist of just the updated fields, plus a field that records the id in the primary
segment of the document affected by the update. These updates would be processed as usual
via secondary IndexWriter-s, as many as there are primary segments, so the same analysis chains
would be used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) would check
for the presence of the "overlay" index, and if so it would open it first (as an AtomicReader?
or it would open individual codec format readers? perhaps it should load the whole thing into
memory?), and it would construct an in-memory map between the primary's docId-s and the overlay's
docId-s. And finally it would wrap the original format readers with "overlay readers", initialized
also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay readers"
would first re-map the primary's docId to the overlay's docId, and check whether overlay data
exists for that docId and this type of data (e.g. postings, stored fields, vectors) and return
this data instead of the original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential access to primary
data would translate into random access to the overlay data. This could be solved by sorting
the overlay index so that at least the overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, since the
segments with updates would pretend to have no overlays) would just work as usual, only the
overlay index would have to be deleted once the primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would be again
handled as usual, only underneath they would open an IndexWriter on the overlay index for
a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the codec level
but got stuck using the approach in LUCENE-3836. The approach that uses a modified SegmentReader
seems more promising.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message