> What does Bobo use the cached bitsets for?
Bobo is a faceting engine that uses custom field caches and sometimes cached bitsets rather than relying exclusively on bitsets to calculate facets. It is useful where many facets (50+) need to be calculated and bitset caching, loading and intersection would be too costly. Instead it iterates over in memory custom field caches while hit collecting. Because we're also doing realtime search, making the loading more efficient via the in memory field cache merging is interesting.
True, we do the in memory merging with deleted docs, norms would be good as well. As a first step how should we expose the segments a segment has originated from? I would like to get this implemented for 2.9 as a building block that perhaps we can write other things on. Column stride fields still requires some encoding and merging field caches in ram would presumably be faster?
> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where each "generation" is a renumbering event).
Couldn't each SegmentReader keep a docmap and the names of the segments it originated from. However the name is not enough of a unique key as there's the deleted docs that change? It seems like we need a unique id for each segment reader, where the id is assigned to cloned readers (which normally have the same segment name as the original SR). The ID could be a stamp (perhaps only given to readonlyreaders?). That way the SegmentReader.getMergedFrom method does not need to return the actual readers, but a docmap and the parent readers IDs? It would be assumed the user would be holding the readers somewhere? Perhaps all this can be achieved with a callback in IW, and all this logic could be kept somewhat internal to Lucene?
On Thu, Apr 2, 2009 at 2:07 PM, Jason Rutherglen
> I'm interested in merging cached bitsets and field caches. While this mayWhat does Bobo use the cached bitsets for?
> be something related to LUCENE-831, in Bobo there are custom field caches
> which we want to merge in RAM (rather than reload from the reader using
> termenum + termdocs). This could somehow lead to delete by doc id.
Merging FieldCache in RAM is also interesting for near-realtime
search, once we have column stride fields. Ie, they should behave
like deleted docs: there's no reason to go through disk when merging
them -- just carry them straight to the merged reader. Only on commit
do they need to go to disk. Hmm in fact we could do this today, too,
eg with norms as a future optimization if needed. And that
optimization applies to flushing as well (ie, when flushing a new
segment, since we know we will open a reader, we could NOT flush the
norms, and instead put them into the reader, and only on eventual
commit, flush to disk).
Well.... the problem with my idea (callback from IW when docs shift)
> Tracking the genealogy of segments is something we can provide as a callback
> from IndexWriter? Or could we add a method to IndexCommit or SegmentReader
> that returns the segments it originated from?
is internally IW always uses the latest reader to get any new docIDs.
Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
(where each "generation" is a renumbering event).
But if you have a reader, perhaps oldish by now, we'd need to give you
a way to map across N generations of docID shifts (which'd require the
Alas I think it will quickly get hairy.