incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: threads
Date Sun, 29 Mar 2009 13:21:58 GMT
On Sat, Mar 28, 2009 at 9:11 PM, Marvin Humphrey <> wrote:

>> Are more than one thread allowed inside the Lucy core at once?
> I would like to.  However, I think it's important to make three points up
> front.
>  1) Concurrency is hard.  Even in languages with comparatively good support
>     for threads like Java, threads programming is a bug-spawning developer
>     timesuck.

I could not agree more (especially coming off of LUCENE-1516!).

I think concurrency is so bad that either 1) we're (the "royal we",
our species) gonna have to punt on micro-concurrency (not use it, ie
use only macro/strongly-divorced concurrency, like multi-process,
one-thread-does-whole-search), or 2) move to better dynamic languages
that are largely declarative such that the runtime has the freedom to
use micro-concurrency as it sees fit.

>  2) We will not be able to abstract multiple host threading models so that we
>     can make sophisticated use of them in the Lucy core.

But it seems like there is a common denominator here ("can more than
one thread be running inside Lucy's core"), regardless of whether
native/user threads, different thread packages, etc, are the actual
threads implementation.  Python does a good job building an abstract
API, that it consumes, on top of the many actual thread

In any event, if Lucy will not allow more than one thread in the core
at once, it still must mate properly with the threads running in the
host.  EG you can't just up and call a Python API; you have to
register your thread with it (likewise for JNI), release/reaquire
Python's global lock when crossing the bridge, etc.  And if more than
one thread attempts to cross the bridge into Lucy, there must be a
troll to make the 2nd thread wait, etc.

Punting on threads still means you need to do something ;)

>  3) Multiple processes will always be available -- but threads won't.
> For those reasons, in my opinion we should keep our threading ambitions to a
> minimum.


> I think we should have two priorities:
>  1) Don't break the host's threading model.
>  2) Make it possible to exploit threads in a limited way during search.
>> Or are we "up-front" expecting one to always use separate processes to
>> gain concurrency?
> Fortunately, thanks to mmap, we are going to be able to make excellent use of
> multiple processes.  If we had no choice but to read index caches into process
> memory every time a la Java Lucene, we would have far more motivation to rely
> on threads within a single process as our primary concurrency model.

You are then forcing the host to make use of multiple processes, too
(or... to access Lucy over some remote "bridge", eg spawn subprocess
and talk over pipes).  I imagine Python & Perl are fairly fast to
startup.  Java in the past was not, though it was a sore point and
maybe is better now (the whole client vs server thing).

> For indexing, I thing we should make it possible to support concurrency using
> multiple indexer *objects*.  Whether those multiple indexer objects get used
> within a single multi-threaded app, or within separate processes shouldn't be
> important.  However, I think it's very important that we not *require* threads
> to exploit concurrency at index-time.

Interesting... so you could take the "strongly divorced" approach,
yet, allow multiple such "worlds" inside a single Lucy process?  This
seems like a nice middle ground.  And mmap is the basis for sharing.

Python allows this too.  You can create multiple interpreter "worlds",
and each freely runs independent of the other (separate global lock).

> For searching, I think we have no choice.  There are certain things which
> cannot be achieved using a process-based concurrency model because portable
> IPC techniques are too crude -- e.g. HitCollector-based scoring routines.

Yes, though "you really should not do that" (have your HitCollector on
the other side of a looooong (IPC) bridge).  Surely you could have
host wrapper embedding Lucy and HitCollector runs there locally, and
then looong IPC bridge is crossed to deliver results.

>> Whichever it is, Lucy will need to do something when crossing the
>> bridge to "mate" to the Host language's thread model.
> I think what we're going to have to do is issue a callback to the Host
> whenever multiple threads might be launched, and wait for that call to return
> after all threads have concluded their work.
> In a multi-threaded Host, several threads might run in parallel.  In a
> single-threaded Host, the threaded calls will run sequentially.


Say lots of searches need to run concurrently (call this "macro"
concurrency, vs the "micro" concurrency below where more than one
thread works on a single search): will Lucy allow a host to send in N
threads doing their own searching, and allow those threads to run
concurrently (if the C platform supports real threads)?

>> At some point, as described above, a single search will need to use
>> concurrency; it seems like Lucy should allow multiple threads into the
>> core for this reason.
> I think we have no choice but to allow threads during search in order to
> exploit multiple processors and return the answer to a query as fast as
> possible.
> Mike, I know you would prefer not to tie the index format to our concurrency
> model, but I think a one-thread-per-segment scoring model makes a lot of
> sense.  Using skipping info could work with core classes, but there's tension
> between that and making it easy to write plugins:
> It's easy to tell a plugin "skip to the next segment".  (In fact, I think we
> might consider making all Scorers single-segment only.)  It's hard to require
> that all Scorer and DataReader subclasses implement intra-segment skipping
> support.
> In order to support multi-threaded search for custom index components, I think
> we should adopt a segment-based model and adjust our index optimization
> APIs and algorithms to fit that model.

OK I think that's a viable approach too.

> I should also note that my personal priority regarding threads has been and
> remains to avoid foreclosing on the option of using them.  However, I'm
> working in a single-threaded environment right now, and I don't have the means
> to test my code for thread-safety.

Not precluding future threads seems like a good overall goal.  Since
you use the host's ref counting, you inherit thread safety for that,
which is nice.

> The first module we'd have to work on to make Lucy safe for threads would be
> Lucy::Util::Hash, which is used to associate class names with VTable
> instances.  However, I'm not going to delay submitting that module for the
> sake of making it thread-safe first.



View raw message