Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Date: Mon, 30 Mar 2009 20:55:36 -0700
To: lucy-dev@lucene.apache.org
Subject: Re: threads
Message-ID: <20090331035536.GA12844@rectangular.com>
References: <9ac0c6aa0903270528k689ff1c5n1ddc557eb7a171fa@mail.gmail.com>
 <20090329011117.GA28601@rectangular.com>
 <9ac0c6aa0903290621o3f904e8ct276a45acb54dfd57@mail.gmail.com>
 <20090329232903.GA31241@rectangular.com>
 <9ac0c6aa0903301508x4e367d2m7daad032573eadac@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <9ac0c6aa0903301508x4e367d2m7daad032573eadac@mail.gmail.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: Marvin Humphrey <marvin@rectangular.com>

On Mon, Mar 30, 2009 at 06:08:28PM -0400, Michael McCandless wrote:

> > �* The VTable_registry Hash which is used to associate VTables with class
> > � �names is stateful.
> 
> What is this hash used for?  Does it map name -> VTable?  

Yes.  It is used to support dynamic subclassing, and also to support
deserialization (when the appopriate deserialization function must be selected
based on the class name).

> It seems like most references within core could be done by the
> compiler/linker?

I think you could say that most references within core rely on referring to
the VTable structs generated at compile-time by Boilerplater.

                     vvvv 
  if (!OBJ_IS_A(obj, HASH)) { 
      THROW("That's not a Hash, it's a %o", Obj_Get_Class_Name(obj));
  }

>  And then when host needs to refer to an object inside the core,
> shouldn't the bindings be exposed however the host would do it (eg as
> a Python module(s) with bindings), also compiled/linked statically.

I don't think I follow that...  Are you saying that dynamic subclassing would
never be needed under Python? 

> > The VTables themselves are stateful because they are refcounted.
> > Furthermore, dynamically created VTables are reclaimed once the last object
> > which needs them goes away -- so the "MyScorer" VTable at one point in the
> > program might not be the same as the "MyScorer" VTable at another point in
> > the program.
> 
> This is neat: so you can make a new subclass at runtime, instantiate a
> bunch of objects off of it, and once all objects are gone, and nobody
> directly refers to the subclass (vtable).  Can you define new vtables
> from the host language?

Yes, transparently, in Perl at least.  Standard Perl subclassing techniques
work.

   package MyObj;
   use base qw( Lucy::Obj );

   package main;

   my $obj = MyObj->new; # MyObj vtable created via inherited constructor.

   undef $obj;           # Triggers DESTROY. Last ref to dynamic VTable for 
                         # MyObj disappears, so it gets reclaimed.

At least that's how things work now.  To make things thread-safe, we'll create
the VTable for "MyObj" dynamically, but once created it will never go away.

> > However, if we stop refcounting VTables (by making Inc_RefCount and
> > Dec_RefCount no-ops) and accept that dynamically created VTables will leak,
> > then those concerns go away. �I think this would be a reasonable approach.
> > The leak ain't gonna matter unless somebody does something psycho like cycle
> > through lots of unique class names.
> 
> That seems tentatively OK.  What do you see dynamic vtables being used
> for in Lucy?

They are required for subclassing in Perl, at least.

> So then we don't have to worry about individual objects being thread
> safe as long as we can ensure threads never share objects.

Yes. :)  [Pending resolution of issues around remaining shared globals.]

> It might get weird if the host also doesn't maintain separate worlds,
> eg the host can think it created object X in Lucy but then if it
> checks back later and gets the wrong world, object X is gone.

I don't understand how this could happen, so perhaps I'm not following.

> > Then there's search-time, where the multi-world approach isn't always
> > adequate. �Here, things get tricky because we have to worry about the
> > statefulness of objects that could be operating in multiple search threads.
> >
> > I think we'll be generally OK if we do most of our scoring at the segment
> > level. �Individual SegReaders don't share much within a larger PolyReader, so
> > Scorers derived from those SegReaders won't either.
> 
> OK even for concurrency within a single search, it sounds like?

All of the core Scorers can be made thread-safe for single-search concurrency.
We just need to avoid using stateful objects which are shared by multiple
SegReaders.  The list of those is reasonably small.

Individual DataReaders (PostingsReader, LexReader, DocReader, DeletionsReader,
etc) might or might not be stateful, but not in ways that are shared (other
than refcounting).  

For instance, consider a PostingsReader.  It holds open at least one InStream
which is stateful on 32-bit systems because of its sliding window.  It would
not be safe to share that InStream across multiple threads -- but there's no
reason that InStream would ever be shared across threads if we are scoring
per-segment, and for that matter, actually cloning the InStream whenever we
spawn a PostingList object.

That PostingsReader also holds a reference to a shared Schema instance, which
may have Analyzers that aren't thread safe.  For that matter, Schema itself
technically isn't thread safe, because you might call Schema_Spec_Field() at
any moment.  However, just because you *can* use those shared elements in a
non-thread-safe way doesn't mean there's a reason you would do so.  No core
Scorer would, at least.

> > Then there's refcounting. �For most objects operating in a per-segment scoring
> > thread, it wouldn't be necessary to synchronize on Inc_RefCount and
> > Dec_RefCount, but I don't see how we'd be able to pick and choose our sync
> > points.
> 
> It sounds like we either 1) try to guarantee threads won't share
> objects, in which case you don't need to lock in the object nor in
> incref/decref, 

... good enough for indexing...

> or 2) allow for the possibility of sharing.  

... required for searching.

> You could use atomic increment/decrement, though I'm unsure how costly
> those really are at the CPU level.

For C under pthreads, where the refcount will presumably be a plain old
integer, that's probably what we want.

Marvin Humphrey