Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Tue, 24 Mar 2009 17:12:22 -0700
To: lucy-dev@lucene.apache.org
Subject: Re: Reference counting inside a GC host (was "real time updates")
Message-ID: <20090325001222.GA21515@rectangular.com>
References: <20090315230128.GB23339@rectangular.com>
 <4B9436C9-4E76-4E11-9C55-0A15627B85FD@mikemccandless.com>
 <20090316004416.GC23339@rectangular.com>
 <9ac0c6aa0903160251j8a1451am64f51b704cf90b40@mail.gmail.com>
 <20090316172948.GA28202@rectangular.com>
 <9ac0c6aa0903161602icd8d9d8jcbf34b682481297@mail.gmail.com>
 <20090317003310.GA28694@rectangular.com>
 <9ac0c6aa0903170250t31e418f9pe6df3d072d9e4738@mail.gmail.com>
 <20090320003355.GA12256@rectangular.com>
 <9ac0c6aa0903201330j44db7a16wd0ddbca73e053da7@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <9ac0c6aa0903201330j44db7a16wd0ddbca73e053da7@mail.gmail.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
From: Marvin Humphrey <marvin@rectangular.com>

On Fri, Mar 20, 2009 at 04:30:21PM -0400, Michael McCandless wrote:

> So... one alternative would be to separately track a private-to-Lucy
> refCount from the host object's refCount?  

Since you refer to the "host object's refCount", I take it the following
passage applies to refcounted systems like Perl and Python, and not to GC
systems like Ruby and Java:

> Then, for Lucy objects that
> never cross the bridge you wouldn't have to make a "false" host
> object.  But you'd need to take care to destroy an object when both
> Lucy's Obj & the host's wrapper obj drop to refCount 1.

OK, if I understand you correctly, then there would be two refcounts.
In essence, we intentionally create a circular reference between the Lucy
object and the host wrapper object.  However, we avoid the memory leak because
whenever either of the two refcounts falls to one, we see whether the other
refcount is also at 1 -- and if that's the case, we destroy the object.

The problem I see here is that we don't necessarily have control over what
happens during the host refcounting.  In Perl at least, I don't know of a way
to override what happens during SvREFCNT_dec, because it's not a method.  So
we can't set up a trigger that fires when the Perl object wrapper's refcount
falls to 1 -- the only event we can count on is the call to $obj->DESTROY()
when the perl reference count falls to 0.

> This may also be better for non-refCount languages (Java).

How would maintaining a refcount for a Java object work?  Under a tracing GC
system, I think you could say that the refcount is "true" until the object
becomes unreachable, at which point it's "false".  Other than the fact that
the object is reachable, you don't know how many things are pointing at it.
Because of that, there's no way to detect when the only reference left is the
one from the Lucy object.

> > However, that's not a major problem unless we're creating and
> > destroying a boatload of small objects.

---->8 snip 8<----

> I think Lucene has improved here too, especially on the indexing side
> (though the searching side doesn't create too many tiny objects I
> think).

Nothing like indexing.  The only spot I can think of is the term dictionary
index, with all those Term and TermInfo objects.  

In Java, you could address that by turning a 1000-object array of 5-element
TermInfo objects into 5 primitive-type arrays with 1000 slots each.  We want
to do something like that anyway because of flexible indexing.

In KS, we no longer generate a lot of Term/TermInfo objects because that data
structure has been modified for use with mmap.  However, we'll want to mod
that system further for flexible indexing and PFOR.  

I look forward to taking that up in the future.  :)

> > literals, and such in a bootstrap routine, but it's enough of a pain
> > to set something like that up that I haven't gone and made such a
> > change in KS.
> >
> > I'd really like to kill of FastObj just for the sake of simplicity,
> > though.
> 
> What objects are planned to subclass FastObj?

VTable, plus the CharBuf family.  

Right now in KS, the following classes also descend from FastObj:

    VArray 
    Posting
    Token
    Inversion (nee TokenBatch)
    ByteBuf
    ViewByteBuf
    Undefined 
    
However, none of them have the declaration/bootstrapping problem presented by
VTable and CharBuf.  Additionally, I hope to eliminate Token (by expanding the
capabilities of Inversion), Posting is due for an index-time overhaul but
isn't a problem at search-time, Undefined isn't important, and ByteBuf and
ViewByteBuf don't present speed problems.

Rebasing VArray and Inversion (which is a subclass of VArray) to be children
of Obj rather than FastObj causes a performance dip of about 3% on the
indexing benchmark.  I imagine that's as bad as it would be for any binding
because Perl objects are pretty heavy.

> > The Ferret scheme won't cause problems with light usage of the
> > library, because most of Lucy's work will be done within tight loops
> > in the C core.
> 
> What about a HitCollector in the host language?  Can you efficiently
> re-use an object from the host language?  (Python has such tricks, eg
> to re-use a TupleObject during iteration).

Oh, definitely a HitCollector would be a problem.

Restating the original assertion: the Ferret scheme won't cause problems if
you avoid host-language subclasses.

> > Do we need to
> > add all new objects to a giant Hash that we tell the host about, and
> > yank C stack vars out of that Hash before the C function returns?
>
> What does Ruby's C-embedding API expose to interact with its GC?  I
> would imagine it'd be similar to Java's (ie "here's a new global root
> object")?

I understand you can do something like that.

> I don't understand the C stack vars / hash question.

Consider the following routine, set up for tracing garbage collection:

    void
    FSFolder_delete(FSFolder *self, const CharBuf *filename)
    {
        CharBuf *fullpath = (CharBuf*)GC_add(
            CB_newf("%o/%o", self->path, filename));
        bool_t result = Host_callback_i(self, "do_delete", 1, 
            ARG_STR("path", fullpath));
        if (!result) {
            THROW("failed to delete %o", fullpath);
        }
        GC_remove(fullpath);
    }

Assume that the CharBuf formatted constructor CB_newf() caches a Host obj
within the CharBuf, and that no reference counting is performed because we're
going to rely upon the host's tracing garbage collector.  

When "fullpath" is created at the top of our function, it's unreferenced as
far as the Host's garbage collector is concerned.  The only place it exists is
on the C stack -- but the garbage collector doesn't walk the C stack looking
for roots.  It walks the host's stack, but not the C stack.  
    
As soon as fullpath is born, it is a candidate for garbage collection, and
could be reclaimed at any moment.  Therefore, we *have* to call GC_add() to
inform the Host about our new CharBuf.

If we *don't* call GC_add(), we could potentially get an invalid read error
while composing the message for THROW.  Say that Host_callback_i doesn't send
"fullpath" itself into host-space, but instead copies its contents into a host
string type.  (This is actually what happens in the current KS implementation.)
Now, say that a GC takes place during Host_callback_i().  Uh-oh, "fullpath" is
toast -- it was unreferenced, and got reclaimed.  If THROW needs it, we'll be
reading from freed memory.

Our only remedy is to call GC_add() early on, protecting fullpath until we
know we don't need it any more.  But calling GC_add() means that we have to
call GC_remove() somewhere, or we'll leak memory.

That call to GC_remove() at the end was what I meant by "yank C stack vars out
of that Hash before the C function returns".

Now, it turns out that the same function set up for a refcounting host looks
pretty similar... eliminate the call to GC_add(), and change the GC_remove() to
a DECREF():

    void
    FSFolder_delete(FSFolder *self, const CharBuf *filename)
    {
        CharBuf *fullpath = CB_newf("%o/%o", self->path, filename);
        bool_t result = Host_callback_i(self, "do_delete", 1, 
            ARG_STR("path", fullpath));
        if (!result) {
            THROW("failed to delete %o", fullpath);
        }
        DECREF(fullpath);
    }

Perhaps it's possible to somehow have the DECREF() op perform double duty and
trigger GC_remove() on a garbage collecting host.  I dunno, that seems likely
to cause double-free problems, but I'll try thinking that through in a future
post.

Marvin Humphrey