> I think I need to understand better why delete by Query isn't
viable in your situation...
The delete by query is a separate problem which I haven't fully
explored yet. Tracking the segment genealogy is really an
interim step for merging field caches before column stride
fields gets implemented. Actually CSF cannot be used with Bobo's
field caches anyways which means we'd need a way to find out
about the segment parents.
> Does it operate at the segment level? Seems like that'd give
you good enough realtime performance (though merging in RAM will
definitely be faster).
We need to see how Bobo integrates with LUCENE-1483.
It seems like we've been talking about CSF for 2 years and there
isn't a patch for it? If I had more time I'd take a look. What
is the status of it?
I'll write a patch that implements a callback for the segment
merging such that the user can decide what information they want
to record about the merged SRs (I'm pretty sure there isn't a
way to do this with MergePolicy?)
On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen
>> What does Bobo use the cached bitsets for?OK.
> Bobo is a faceting engine that uses custom field caches and sometimes cached
> bitsets rather than relying exclusively on bitsets to calculate facets. It
> is useful where many facets (50+) need to be calculated and bitset caching,
> loading and intersection would be too costly. Instead it iterates over in
> memory custom field caches while hit collecting. Because we're also doing
> realtime search, making the loading more efficient via the in memory field
> cache merging is interesting.
Does it operate at the segment level? Seems like that'd give you good
enough realtime performance (though merging in RAM will definitely be
Yes, and eventually column stride fields.
> True, we do the in memory merging with deleted docs, norms would be good as
I'm not sure; it's quite messy. Each segment must track what other
> As a first step how should we expose the segments a segment has
> originated from?
segment it got merged to, and must hold a copy of its deletes as of
the time it was merged. And each segment must know what other
segments it got merged with.
Is this really a serious problem in your realtime search? Eg, from
John's numbers in using payloads to read in the docID -> UID mapping,
it seems like you could make a Query that when given a reader would go
and do the "Approach 2" to perform the deletes (if indeed you are
needing to delete thousands of docs with each update). What sort of
docs/sec rates are you needing to handle?
I think that's optimistic. It's still at the
> I would like to get this implemented for 2.9 as a building
> block that perhaps we can write other things on.
hairy-can't-see-a-clean-way-to-do-it phase still. Plus I'd like to
understand that all other options have been exhausted first.
Especially once we have column stride fields and they are merged in
RAM, you'll be handed a reader pre-warmed and you can then jump
through those arrays to find docs to delete.
Yes, potentially much faster. There's no sense in writing through to
> Column stride fields still
> requires some encoding and merging field caches in ram would presumably be
disk until commit is called.
The docMap is a costly way to store it, since it consumes 32 bits per
>> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where
>> each "generation" is a renumbering event).
> Couldn't each SegmentReader keep a docmap and the names of the segments it
> originated from. However the name is not enough of a unique key as there's
> the deleted docs that change? It seems like we need a unique id for each
> segment reader, where the id is assigned to cloned readers (which normally
> have the same segment name as the original SR). The ID could be a stamp
> (perhaps only given to readonlyreaders?). That way the
> SegmentReader.getMergedFrom method does not need to return the actual
> readers, but a docmap and the parent readers IDs? It would be assumed the
> user would be holding the readers somewhere? Perhaps all this can be
> achieved with a callback in IW, and all this logic could be kept somewhat
> internal to Lucene?
doc (vs storing a copy of the deleted docs).
But, then docMap gives you random-access on the map.
What if prior to merging, or committing merged deletes, there were a
callback to force the app to materialize any privately buffered
deletes? And then the app is not allowed to use those readers for
further deletes? Still kinda messy.
I think I need to understand better why delete by Query isn't viable
in your situation...