incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: threads
Date Tue, 31 Mar 2009 09:32:54 GMT
On Mon, Mar 30, 2009 at 11:55 PM, Marvin Humphrey
<> wrote:
> On Mon, Mar 30, 2009 at 06:08:28PM -0400, Michael McCandless wrote:
>> >  * The VTable_registry Hash which is used to associate VTables with class
>> >    names is stateful.
>> What is this hash used for?  Does it map name -> VTable?
> Yes.  It is used to support dynamic subclassing, and also to support
> deserialization (when the appopriate deserialization function must be selected
> based on the class name).

The deserialization need makes sense, though I guess you can't you
serialize/deserialize dynamic vtables?  (Since as you showed MyScorer
can be different at different times... maybe you can consider making
the VTable_registry write once).

Dynamic subclassing I think just comes down to you do you gain a
reference to the vtable you want to subclass, which you could do by
lookup by name, but I don't yet see why you must do it that way.

>> It seems like most references within core could be done by the
>> compiler/linker?
> I think you could say that most references within core rely on referring to
> the VTable structs generated at compile-time by Boilerplater.
>                     vvvv
>  if (!OBJ_IS_A(obj, HASH)) {
>      THROW("That's not a Hash, it's a %o", Obj_Get_Class_Name(obj));
>  }

Right, so if you make a dynamic vtable inside the Lucy core, it'd
presumably have a reference to the vtable via the compiler.  EG maybe
I have a "MakeCustomScorer" method, and inside there it's got a
reference to the global Scorer (statically created) vtable that the C
compiler/linker mapped.

>>  And then when host needs to refer to an object inside the core,
>> shouldn't the bindings be exposed however the host would do it (eg as
>> a Python module(s) with bindings), also compiled/linked statically.
> I don't think I follow that...  Are you saying that dynamic subclassing would
> never be needed under Python?

No, it would be needed.

I'm saying the Python module wrapping Lucy would expose such bindings.  EG:

  import lucy
  print 'The scorer vtable is %s' % lucy.Scorer

(or perhaps lucy has sub-modules).

And the C code for that python lucy module would have rerenced the C
static global Scorer (just like the above example).  Dynamic lookup by
name isn't necessary because you'd be relying on Python's module/dict
capabilities to do so.

>> > The VTables themselves are stateful because they are refcounted.
>> > Furthermore, dynamically created VTables are reclaimed once the last object
>> > which needs them goes away -- so the "MyScorer" VTable at one point in the
>> > program might not be the same as the "MyScorer" VTable at another point in
>> > the program.
>> This is neat: so you can make a new subclass at runtime, instantiate a
>> bunch of objects off of it, and once all objects are gone, and nobody
>> directly refers to the subclass (vtable).  Can you define new vtables
>> from the host language?
> Yes, transparently, in Perl at least.  Standard Perl subclassing techniques
> work.
>   package MyObj;
>   use base qw( Lucy::Obj );
>   package main;
>   my $obj = MyObj->new; # MyObj vtable created via inherited constructor.
>   undef $obj;           # Triggers DESTROY. Last ref to dynamic VTable for
>                         # MyObj disappears, so it gets reclaimed.
> At least that's how things work now.  To make things thread-safe, we'll create
> the VTable for "MyObj" dynamically, but once created it will never go away.

Once created it never goes away, because we've decided it's fine to
leak them?  (Which I agree with).

Can a vtable gain new methods dynamically?

>> > However, if we stop refcounting VTables (by making Inc_RefCount and
>> > Dec_RefCount no-ops) and accept that dynamically created VTables will leak,
>> > then those concerns go away.  I think this would be a reasonable approach.
>> > The leak ain't gonna matter unless somebody does something psycho like cycle
>> > through lots of unique class names.
>> That seems tentatively OK.  What do you see dynamic vtables being used
>> for in Lucy?
> They are required for subclassing in Perl, at least.

Ahh, I see.  When one subclasses in Perl, it will make a corresponding
dynamic vtable in Lucy?  And then when one instantiates a bunch of
objects from such a class in Perl, these have corresponding bunches of
objects (not vtables) in Lucy's core?

>> So then we don't have to worry about individual objects being thread
>> safe as long as we can ensure threads never share objects.
> Yes. :)  [Pending resolution of issues around remaining shared globals.]

I like this approach but I fear it may be fragile, ie hitting SEGV
instead of a RuntimeException on messing something small up.  Not
gracefully degrading...

>> It might get weird if the host also doesn't maintain separate worlds,
>> eg the host can think it created object X in Lucy but then if it
>> checks back later and gets the wrong world, object X is gone.
> I don't understand how this could happen, so perhaps I'm not following.

Say from Python you startup Lucy, and make 4 worlds (since that's how
much concurrency you want to use), for searching.  Is there a single
Python interpreter (since Python has no trouble letting many threads
in)?  If so, from Python I must be aware of those separate Lucy
worlds... eg if I want to make a new dynamic class, I'd need to notify
all 4 worlds to create the corresponding vtable?

I think the multi-world approach is going to be problematic.

(Vs the single-world-but-threads-rarely-share-objects, which seems
best except for possible fragility).

>> > Then there's search-time, where the multi-world approach isn't always
>> > adequate.  Here, things get tricky because we have to worry about the
>> > statefulness of objects that could be operating in multiple search threads.
>> >
>> > I think we'll be generally OK if we do most of our scoring at the segment
>> > level.  Individual SegReaders don't share much within a larger PolyReader,
>> > Scorers derived from those SegReaders won't either.
>> OK even for concurrency within a single search, it sounds like?
> All of the core Scorers can be made thread-safe for single-search concurrency.
> We just need to avoid using stateful objects which are shared by multiple
> SegReaders.  The list of those is reasonably small.
> Individual DataReaders (PostingsReader, LexReader, DocReader, DeletionsReader,
> etc) might or might not be stateful, but not in ways that are shared (other
> than refcounting).
> For instance, consider a PostingsReader.  It holds open at least one InStream
> which is stateful on 32-bit systems because of its sliding window.  It would
> not be safe to share that InStream across multiple threads -- but there's no
> reason that InStream would ever be shared across threads if we are scoring
> per-segment, and for that matter, actually cloning the InStream whenever we
> spawn a PostingList object.
> That PostingsReader also holds a reference to a shared Schema instance, which
> may have Analyzers that aren't thread safe.  For that matter, Schema itself
> technically isn't thread safe, because you might call Schema_Spec_Field() at
> any moment.  However, just because you *can* use those shared elements in a
> non-thread-safe way doesn't mean there's a reason you would do so.  No core
> Scorer would, at least.

Right. This would be a "cooperative" approach to threading.  I
carefully don't share things with you, so we won't crash and burn.
Sort of like the each-kid-gets-their-own-cup-so-germs-don't-spread
rule that many parents adopt.

>> > Then there's refcounting.  For most objects operating in a per-segment scoring
>> > thread, it wouldn't be necessary to synchronize on Inc_RefCount and
>> > Dec_RefCount, but I don't see how we'd be able to pick and choose our sync
>> > points.
>> It sounds like we either 1) try to guarantee threads won't share
>> objects, in which case you don't need to lock in the object nor in
>> incref/decref,
> ... good enough for indexing...
>> or 2) allow for the possibility of sharing.
> ... required for searching.
>> You could use atomic increment/decrement, though I'm unsure how costly
>> those really are at the CPU level.
> For C under pthreads, where the refcount will presumably be a plain old
> integer, that's probably what we want.

Be careful: I'm unsure how bad these instructions are at the CPU
level.  They feel wonderful in C, but I think they may use the LOCK
instruction (on Intel CPUs) and I think I've read bad things about
what that instruction requires of all other running CPUs.  I'm not
certain, though, and that was a while ago.

I think a far better approach, if we can pull it off, is to not do any
locking because we are certain that some objects are never shared.
Perhaps there'd be "thread safe" incref/decref, and "private"
incref/decref.  But fragility is there...


View raw message