lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <>
Subject Re: Future projects
Date Thu, 02 Apr 2009 22:55:42 GMT
Just to clarify, Approach 1 and approach 2 are both currently performing ok
currently for us.

On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless <> wrote:

> On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen
> <> wrote:
> >> What does Bobo use the cached bitsets for?
> >
> > Bobo is a faceting engine that uses custom field caches and sometimes
> cached
> > bitsets rather than relying exclusively on bitsets to calculate facets.
> It
> > is useful where many facets (50+) need to be calculated and bitset
> caching,
> > loading and intersection would be too costly.  Instead it iterates over
> in
> > memory custom field caches while hit collecting.  Because we're also
> doing
> > realtime search, making the loading more efficient via the in memory
> field
> > cache merging is interesting.
> OK.
> Does it operate at the segment level?  Seems like that'd give you good
> enough realtime performance (though merging in RAM will definitely be
> faster).
> > True, we do the in memory merging with deleted docs, norms would be good
> as
> > well.
> Yes, and eventually column stride fields.
> > As a first step how should we expose the segments a segment has
> > originated from?
> I'm not sure; it's quite messy.  Each segment must track what other
> segment it got merged to, and must hold a copy of its deletes as of
> the time it was merged.  And each segment must know what other
> segments it got merged with.
> Is this really a serious problem in your realtime search?  Eg, from
> John's numbers in using payloads to read in the docID -> UID mapping,
> it seems like you could make a Query that when given a reader would go
> and do the "Approach 2" to perform the deletes (if indeed you are
> needing to delete thousands of docs with each update).  What sort of
> docs/sec rates are you needing to handle?
> > I would like to get this implemented for 2.9 as a building
> > block that perhaps we can write other things on.
> I think that's optimistic.  It's still at the
> hairy-can't-see-a-clean-way-to-do-it phase still.  Plus I'd like to
> understand that all other options have been exhausted first.
> Especially once we have column stride fields and they are merged in
> RAM, you'll be handed a reader pre-warmed and you can then jump
> through those arrays to find docs to delete.
> > Column stride fields still
> > requires some encoding and merging field caches in ram would presumably
> be
> > faster?
> Yes, potentially much faster.  There's no sense in writing through to
> disk until commit is called.
> >> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
> (where
> >> each "generation" is a renumbering event).
> >
> > Couldn't each SegmentReader keep a docmap and the names of the segments
> it
> > originated from.  However the name is not enough of a unique key as
> there's
> > the deleted docs that change?  It seems like we need a unique id for each
> > segment reader, where the id is assigned to cloned readers (which
> normally
> > have the same segment name as the original SR).  The ID could be a stamp
> > (perhaps only given to readonlyreaders?).  That way the
> > SegmentReader.getMergedFrom method does not need to return the actual
> > readers, but a docmap and the parent readers IDs?  It would be assumed
> the
> > user would be holding the readers somewhere?  Perhaps all this can be
> > achieved with a callback in IW, and all this logic could be kept somewhat
> > internal to Lucene?
> The docMap is a costly way to store it, since it consumes 32 bits per
> doc (vs storing a copy of the deleted docs).
> But, then docMap gives you random-access on the map.
> What if prior to merging, or committing merged deletes, there were a
> callback to force the app to materialize any privately buffered
> deletes?  And then the app is not allowed to use those readers for
> further deletes?  Still kinda messy.
> I think I need to understand better why delete by Query isn't viable
> in your situation...
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message