incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: real time updates
Date Mon, 16 Mar 2009 09:51:22 GMT
Marvin Humphrey wrote:

> > Right.  I guess it's because Lucene buffers up deletes that it can
> > continue to accept adds & deletes even during the blip.  But it
> > cannot write a new segment (materialize the adds & deletes) during
> > the blip.
> OK, I think that makes sense.  Lucene isn't so much performing
> deletions as promising to perform deletions at some point in the
> future.  There's still a window where no new deletions are being
> performed (the "blip"), and the process of reconciling deletions
> finishes during this window.

Right.  But added docs are in fact indexed during the blip (just no
segment can be written).

Remind me again how KS/Lucy does the deletes?  It must buffer if
you've buffered add docs?  Or is it not possible to delete a recently
added doc?

> > (Though... that's tricky, with deletes; oh maybe because you store
> > new deletes for an old segment along with the new segment that's
> > OK?  Hmm, it still seems like you'd have a staleness problem).
> What if we have the deletions reader OR together all bit vectors
> against a given segment?  Search-time performance would dive of
> course, but I believe we'd get logically correct results.
> Under the Lucene bit-vector naming scheme, you'd need to keep every
> deletions file around for the life of a given segment -- at least
> until you had a consolidator process lock everything down and write
> an authoritative bit vector.  With the current KS bit-vector naming
> scheme, out of date bit-vector files would be zapped by the merging
> process (which in this case means the consolidator).  I don't think
> it's any more efficient, though it's arguably cleaner.

You mean if we were to implement such a multi-writer approach in
Lucene, we'd need to keep all the _X_N.del's around?  Actually, I
think we'd also zap them on completing the merge (& carrying over any
new deletes).  But I don't think Lucene will do multi-process writing
any time soon.  We have good concurrency (I think -- there has been at
least one user report to the contrary) w/ multiple threads in a single

> The tombstone approach would work for the same reason.  It doesn't
> matter if multiple tombstone rows contain a tombstone for the same
> document, because the priority queue ORs together the results.
> Therfore, you don't need to coordinate the addition of new
> tombstones.


> Claiming a new segment directory and committing a new master file
> (segments_XXX in Lucene, snapshot_XXX.json in KS) wouldn't require
> synchronization: if those ops fail because your process lost out in
> the race condition, you just retry.  The only time we have true
> synchronization requirements is during merging.

Wouldn't you need to synchronize here?  If two writers try to suddenly
write to the same snapshot_XXX.json, is that really safe (ie, one
always wins, and the other always loses and can reliably detect that
it lost)?

> So... if we were to somehow make tombstones perform adequately at
> search-time, I think we could make a many-writers-single-merger
> model work.

I think this (reader ORs the multiple deleted docs) would be workable;
though it makes realtime search challenging.

> > Ugh, lock starvation.  Really the OS should provide a FIFO lock
> > queue of some sort.
> Well, I think this would be less of a headache if we didn't need
> portability.  It's just that the locking and IPC mechanisms provided
> by various operating systems out there are wildly incompatible.

Tell me about it... we've had that discussion before.

> Unfortunately, I don't think there's any other way to implement
> background merging for all Lucy target hosts besides the
> multiple-process approach.  Lucy will never work with Perl ithreads.

How, in general, are you approaching threads with Lucy?  It seems
ashame to forego threads entirely.  What's common in Python, Perl,
ruby, etc.'s, thread semantics?

EG Python is great about threads, in that they are native threads to
the platform.  Python itself has a giant lock that prevents multiple
thread from running in the interpreter at once; but once the threads
enter Lucy code they would release this lock and gain concurrency.

> PS: FYI, your messages today have premature line-wrapping issues --
> your original text, not just the quotes.

GRRR.  Thanks for pointing it out.  Why can't everything just
work?  Is this email better?


View raw message