incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: real time updates
Date Mon, 16 Mar 2009 23:02:05 GMT
Marvin Humphrey <> wrote:

>> Remind me again how KS/Lucy does the deletes?  It must buffer if
>> you've buffered add docs?  Or is it not possible to delete a
>> recently added doc?
> The present implementation in KS proceeds commit by commit.  You
> can't delete a doc that hasn't been committed yet.

OK; I guess the salient question is how important is it, really, for
deletes to apply to docs added in the current session.  Do users ever
miss that?

I think with realtime search (at least how Lucene is planning to do
it), it actually is important because one could keep a writer session
open arbitrarily long, reopening readers as you go.

> A common technique when you are either adding or updating but don't
> know which is to always delete based on a primary key field, then
> add:
>  while ( my $doc = give_me_another_doc() ) {
>    $indexer->delete_by_term( field => 'key', term => $doc->{key} );
>    $indexer->add_doc($doc);
>  }
> If you use an app with that loop to add a very large number of docs,
> the buffered deletes will take up a lot of space in RAM.

Lucene also flushes based on number of buffered deletes.

> I think in Lucene, you flush more often to a new segment.  In KS
> that only happens once per indexing session.

Right, OK.  Though Lucene takes a similar approach to KS, it's just
that we allow the intermediate segments (runs?) to be committed as
segments, and then normal merging merges them (ie, the doc store files
are shared across segments for a single session).

>> > Claiming a new segment directory and committing a new master file
>> > (segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
>> > synchronization: if those ops fail because your process lost out in
>> > the race condition, you just retry.  The only time we have true
>> > synchronization requirements is during merging.
>> Wouldn't you need to synchronize here?  If two writers try to suddenly
>> write to the same snapshot_XXX.json, is that really safe (ie, one
>> always wins, and the other always loses and can reliably detect that
>> it lost)?
> You're right; I hadn't thought things through well enough.  There's
> no C API for a rename() op with the semantics of POSIX open() with
> O_EXCL.  No big deal, though -- we can use a commit.lock file for
> synchronizing those two ops.


>> I think this (reader ORs the multiple deleted docs) would be
>> workable; though it makes realtime search challenging.
> Yeah.  I think if we cap the deletion percentage at 10% and merge
> any segment that exceeds that rate, plus apply deletions at the
> Scorer_Collect level, tombstones won't absolutely murder us.  But
> there's some work to do there.

Run some C perf tests first!

>> > > Ugh, lock starvation.  Really the OS should provide a FIFO lock
>> > > queue of some sort.
> Heh, I think I just figured out how to solve the lock starvation
> issue.  :)
> When the consolidator completes, it will attempt once to acquire
> write.lock.  If it fails, it will leave behind the consolidate.lock
> file, plus write a "consolidate.done" file containing the name of
> the newly added segment and maybe some other info, then exit.
> Whenever one of the write processes finishes, it will look for
> "consolidate.done".  If it's found, the write process will pick up
> where the consolidator left off, finishing all merge-related
> operations and then releasing consolidate.lock.

Ahh, nice.  Drop a continuation into the filesystem :)

>> How, in general, are you approaching threads with Lucy?  It seems a
>> shame to forego threads entirely.
> We've never formally addressed the issue of threads.  The approach
> I've taken so far with KS is to try to keep my options open.  I've
> tried to keep most classes thread-safe when it's not too much work.
> I think the overall architecture should still be compatible with
> threads.  The only gaping holes are that KS lacks a lock-free hash
> for storing VTables and perhaps interned strings, and that the
> reference-counting GC model might be a synchronization bottleneck.

Interesting, Lucy will do GC of the search objects?  Is this used for
all objects, eg Document, Field, Term, Query, etc.?

Watch out for cycles with reference counting...

> However, I'm of two minds about this.  Threads are a PITA, and
> thanks to mmap and the OS cache, we might be able to use OS
> processes as our exclusive concurrency model.  Launching any number
> of search processes won't be a problem, and it looks like we might
> be able to pull off multiple write processes as well.
> There's a theoretical search-time performance issue on the horizon
> relating to threads: it would be nice to have a large index divided
> up into several segments of equal size, allowing us to dedicate one
> thread to each segment and finish a single search faster.  But even
> that is theoretically achievable using multiple processes under a
> MultiSearcher.

Yes, being mmap driven should mean processes are a fine substitute for

Yes I've been wondering how we can take advantage of concurrency
within a single search in Lucene.  I think it's important, especially
as we all switch to SSDs.

> It would be nice, though, if we could make Lucy thread-safe as a
> library, so that we don't break other people's threading models.
> That way, you could, say, run several indexers and one consolidator
> within a single multi-threaded app -- realizing the benefit of
> threads even if no individual Lucy object knows how to exploit them.



View raw message