lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: Future projects
Date Thu, 02 Apr 2009 21:56:10 GMT
> I think I need to understand better why delete by Query isn't
viable in your situation...

The delete by query is a separate problem which I haven't fully
explored yet. Tracking the segment genealogy is really an
interim step for merging field caches before column stride
fields gets implemented. Actually CSF cannot be used with Bobo's
field caches anyways which means we'd need a way to find out
about the segment parents.

> Does it operate at the segment level? Seems like that'd give
you good enough realtime performance (though merging in RAM will
definitely be faster).

We need to see how Bobo integrates with LUCENE-1483.

It seems like we've been talking about CSF for 2 years and there
isn't a patch for it? If I had more time I'd take a look. What
is the status of it?

I'll write a patch that implements a callback for the segment
merging such that the user can decide what information they want
to record about the merged SRs (I'm pretty sure there isn't a
way to do this with MergePolicy?)


On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen
> <jason.rutherglen@gmail.com> wrote:
> >> What does Bobo use the cached bitsets for?
> >
> > Bobo is a faceting engine that uses custom field caches and sometimes
> cached
> > bitsets rather than relying exclusively on bitsets to calculate facets.
> It
> > is useful where many facets (50+) need to be calculated and bitset
> caching,
> > loading and intersection would be too costly.  Instead it iterates over
> in
> > memory custom field caches while hit collecting.  Because we're also
> doing
> > realtime search, making the loading more efficient via the in memory
> field
> > cache merging is interesting.
>
> OK.
>
> Does it operate at the segment level?  Seems like that'd give you good
> enough realtime performance (though merging in RAM will definitely be
> faster).
>
> > True, we do the in memory merging with deleted docs, norms would be good
> as
> > well.
>
> Yes, and eventually column stride fields.
>
> > As a first step how should we expose the segments a segment has
> > originated from?
>
> I'm not sure; it's quite messy.  Each segment must track what other
> segment it got merged to, and must hold a copy of its deletes as of
> the time it was merged.  And each segment must know what other
> segments it got merged with.
>
> Is this really a serious problem in your realtime search?  Eg, from
> John's numbers in using payloads to read in the docID -> UID mapping,
> it seems like you could make a Query that when given a reader would go
> and do the "Approach 2" to perform the deletes (if indeed you are
> needing to delete thousands of docs with each update).  What sort of
> docs/sec rates are you needing to handle?
>
> > I would like to get this implemented for 2.9 as a building
> > block that perhaps we can write other things on.
>
> I think that's optimistic.  It's still at the
> hairy-can't-see-a-clean-way-to-do-it phase still.  Plus I'd like to
> understand that all other options have been exhausted first.
>
> Especially once we have column stride fields and they are merged in
> RAM, you'll be handed a reader pre-warmed and you can then jump
> through those arrays to find docs to delete.
>
> > Column stride fields still
> > requires some encoding and merging field caches in ram would presumably
> be
> > faster?
>
> Yes, potentially much faster.  There's no sense in writing through to
> disk until commit is called.
>
> >> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2
> (where
> >> each "generation" is a renumbering event).
> >
> > Couldn't each SegmentReader keep a docmap and the names of the segments
> it
> > originated from.  However the name is not enough of a unique key as
> there's
> > the deleted docs that change?  It seems like we need a unique id for each
> > segment reader, where the id is assigned to cloned readers (which
> normally
> > have the same segment name as the original SR).  The ID could be a stamp
> > (perhaps only given to readonlyreaders?).  That way the
> > SegmentReader.getMergedFrom method does not need to return the actual
> > readers, but a docmap and the parent readers IDs?  It would be assumed
> the
> > user would be holding the readers somewhere?  Perhaps all this can be
> > achieved with a callback in IW, and all this logic could be kept somewhat
> > internal to Lucene?
>
> The docMap is a costly way to store it, since it consumes 32 bits per
> doc (vs storing a copy of the deleted docs).
>
> But, then docMap gives you random-access on the map.
>
> What if prior to merging, or committing merged deletes, there were a
> callback to force the app to materialize any privately buffered
> deletes?  And then the app is not allowed to use those readers for
> further deletes?  Still kinda messy.
>
> I think I need to understand better why delete by Query isn't viable
> in your situation...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message