lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <>
Subject Re: Future projects
Date Thu, 02 Apr 2009 20:43:11 GMT
> What does Bobo use the cached bitsets for?

Bobo is a faceting engine that uses custom field caches and sometimes cached
bitsets rather than relying exclusively on bitsets to calculate facets.  It
is useful where many facets (50+) need to be calculated and bitset caching,
loading and intersection would be too costly.  Instead it iterates over in
memory custom field caches while hit collecting.  Because we're also doing
realtime search, making the loading more efficient via the in memory field
cache merging is interesting.

True, we do the in memory merging with deleted docs, norms would be good as
well.  As a first step how should we expose the segments a segment has
originated from?  I would like to get this implemented for 2.9 as a building
block that perhaps we can write other things on.  Column stride fields still
requires some encoding and merging field caches in ram would presumably be

> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where
each "generation" is a renumbering event).

Couldn't each SegmentReader keep a docmap and the names of the segments it
originated from.  However the name is not enough of a unique key as there's
the deleted docs that change?  It seems like we need a unique id for each
segment reader, where the id is assigned to cloned readers (which normally
have the same segment name as the original SR).  The ID could be a stamp
(perhaps only given to readonlyreaders?).  That way the
SegmentReader.getMergedFrom method does not need to return the actual
readers, but a docmap and the parent readers IDs?  It would be assumed the
user would be holding the readers somewhere?  Perhaps all this can be
achieved with a callback in IW, and all this logic could be kept somewhat
internal to Lucene?

On Thu, Apr 2, 2009 at 12:59 PM, Michael McCandless <> wrote:

> On Thu, Apr 2, 2009 at 2:07 PM, Jason Rutherglen
> <> wrote:
> > I'm interested in merging cached bitsets and field caches.  While this
> may
> > be something related to LUCENE-831, in Bobo there are custom field caches
> > which we want to merge in RAM (rather than reload from the reader using
> > termenum + termdocs).  This could somehow lead to delete by doc id.
> What does Bobo use the cached bitsets for?
> Merging FieldCache in RAM is also interesting for near-realtime
> search, once we have column stride fields.  Ie, they should behave
> like deleted docs: there's no reason to go through disk when merging
> them -- just carry them straight to the merged reader.  Only on commit
> do they need to go to disk.  Hmm in fact we could do this today, too,
> eg with norms as a future optimization if needed.  And that
> optimization applies to flushing as well (ie, when flushing a new
> segment, since we know we will open a reader, we could NOT flush the
> norms, and instead put them into the reader, and only on eventual
> commit, flush to disk).
> > Tracking the genealogy of segments is something we can provide as a
> callback
> > from IndexWriter?  Or could we add a method to IndexCommit or
> SegmentReader
> > that returns the segments it originated from?
> Well.... the problem with my idea (callback from IW when docs shift)
> is internally IW always uses the latest reader to get any new docIDs.
> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
> (where each "generation" is a renumbering event).
> But if you have a reader, perhaps oldish by now, we'd need to give you
> a way to map across N generations of docID shifts (which'd require the
> genealogy tracking).
> Alas I think it will quickly get hairy.
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message