incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: real time updates
Date Tue, 17 Mar 2009 09:50:14 GMT
Marvin Humphrey <> wrote:

>> OK; I guess the salient question is how important is it, really,
>> for deletes to apply to docs added in the current session.  Do
>> users ever miss that?
> It's been noted.  I don't think anyone thinks the current KS
> behavior is better.


>> Right, OK.  Though Lucene takes a similar approach to KS, it's just
>> that we allow the intermediate segments (runs?) to be committed as
>> segments, and then normal merging merges them (ie, the doc store
>> files are shared across segments for a single session).
> I think we can make that work for Lucy, too.  The KS external sorter
> now writes temporary files using the standard lexicon and postings
> formats.  It's flushing these to "runs", but there's not much of a
> difference between these runs and actual segments.

Right.  It seems like redundant code to merge a run and to merge
segments.  And it makes close() unexpectedly costly when you force
merging of the runs to complete.

> Plus, the Snapshot class I'm about to propose for Lucy is
> specifically designed to accommodate extra-segment files like the
> docs stores.
> Cross-pollination FTW. :)


>> > Yeah.  I think if we cap the deletion percentage at 10% and merge
>> > any segment that exceeds that rate, plus apply deletions at the
>> > Scorer_Collect level, tombstones won't absolutely murder us.  But
>> > there's some work to do there.
>> Run some C perf tests first!
> For now I'll just put it off and use bit vector deletions.  I only need one
> writer and one consolidator at present.  Total throughput isn't a problem --
> worst-case performance is what's driving this feature.

With Lucene, since we carry deletes in RAM and committing is expected
to be relatively rare, our "worst case" measure is the time it takes
to reopen a reader after making a bunch of adds/deletes to the index.

To reopen the reader we need to clone the norms/deletes, and that's
where our equivalent worst-case cost comes in.

But since we're in RAM we can consider other interesting transactional
data structures.

>> Interesting, Lucy will do GC of the search objects?  Is this used
>> for all objects, eg Document, Field, Term, Query, etc.?
> It's basically a straight-up refcounting model: each object keeps
> its own refcount, call Destroy() as soon as the refcount hits zero.
> All full-fledged objects are refcounted.
> There are some quirks, though, with how it manages host objects.
> The default behavior is to create a host language object at the same
> time as the Boilerplater object, and have the host object manage the
> refcount.

Hmm, sounds tricky... because there are typically consumers in C and
in the Host language and both need to incRef/decRef.

> However, there are a few classes -- CharBuf, VTable, and Posting
> among them -- that subclass "FastObj", which has its own refcount,
> and creates a host wrapper each time it crosses into host space.


> The second member in the Obj struct is a union:
>    typedef union boil_ref_t {
>        chy_u32_t count;
>        void *host_obj;
>    } boil_ref_t;
> FastObj keeps track of its refcounts using self->ref.count; Obj delegates to
> self->ref.host_obj.  It's up to the binding to provide implementations for
> Obj_Inc_RefCount and Obj_Dec_RefCount.
> I haven't worked out the details on how the refcounted model is going to
> function within GC systems like Java and Ruby.  However, I think writing a
> full-featured tracing garbage collector that's supposed to work within a
> reference counted system like Perl's would be harder.
> I've tried searching the web for resources on how to make
> refcounting and GC coexist happily, but I haven't found anything so
> far.  If anybody's got strong google-fu today or actually knows and
> can recommend some literature, I'm all ears.

This is tricky!

>> I've been wondering how we can take advantage of concurrency within
>> a single search in Lucene.
> Well, doesn't it at least make logical sense to dedicate one thread
> per segment during search?  Then you just have to adopt a different
> definition for "optimize": instead of an "optimized" index always
> meaning "one segment", it means "one segment per thread".

I'm not sure I like tying single-search runtime concurrency directly
to the index structure (this was broached on a Jira issue somewhere).
I think I'd prefer to have each thread skipTo its starting docID and
process its chunk, instead of wiring one-thread-per-segment, but I'm
really not sure yet.


View raw message