Return-Path: Delivered-To: apmail-lucene-lucy-dev-archive@minotaur.apache.org Received: (qmail 24528 invoked from network); 25 Mar 2009 00:12:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Mar 2009 00:12:50 -0000 Received: (qmail 28531 invoked by uid 500); 25 Mar 2009 00:12:50 -0000 Delivered-To: apmail-lucene-lucy-dev-archive@lucene.apache.org Received: (qmail 28470 invoked by uid 500); 25 Mar 2009 00:12:50 -0000 Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@lucene.apache.org Delivered-To: mailing list lucy-dev@lucene.apache.org Received: (qmail 28460 invoked by uid 99); 25 Mar 2009 00:12:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Mar 2009 00:12:50 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.38.202] (HELO rectangular.com) (68.116.38.202) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Mar 2009 00:12:43 +0000 Received: from marvin by rectangular.com with local (Exim 4.63) (envelope-from ) id 1LmGjG-0005cD-Mp for lucy-dev@lucene.apache.org; Tue, 24 Mar 2009 17:12:22 -0700 Date: Tue, 24 Mar 2009 17:12:22 -0700 To: lucy-dev@lucene.apache.org Subject: Re: Reference counting inside a GC host (was "real time updates") Message-ID: <20090325001222.GA21515@rectangular.com> References: <20090315230128.GB23339@rectangular.com> <4B9436C9-4E76-4E11-9C55-0A15627B85FD@mikemccandless.com> <20090316004416.GC23339@rectangular.com> <9ac0c6aa0903160251j8a1451am64f51b704cf90b40@mail.gmail.com> <20090316172948.GA28202@rectangular.com> <9ac0c6aa0903161602icd8d9d8jcbf34b682481297@mail.gmail.com> <20090317003310.GA28694@rectangular.com> <9ac0c6aa0903170250t31e418f9pe6df3d072d9e4738@mail.gmail.com> <20090320003355.GA12256@rectangular.com> <9ac0c6aa0903201330j44db7a16wd0ddbca73e053da7@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9ac0c6aa0903201330j44db7a16wd0ddbca73e053da7@mail.gmail.com> User-Agent: Mutt/1.5.13 (2006-08-11) From: Marvin Humphrey X-Virus-Checked: Checked by ClamAV on apache.org On Fri, Mar 20, 2009 at 04:30:21PM -0400, Michael McCandless wrote: > So... one alternative would be to separately track a private-to-Lucy > refCount from the host object's refCount? Since you refer to the "host object's refCount", I take it the following passage applies to refcounted systems like Perl and Python, and not to GC systems like Ruby and Java: > Then, for Lucy objects that > never cross the bridge you wouldn't have to make a "false" host > object. But you'd need to take care to destroy an object when both > Lucy's Obj & the host's wrapper obj drop to refCount 1. OK, if I understand you correctly, then there would be two refcounts. In essence, we intentionally create a circular reference between the Lucy object and the host wrapper object. However, we avoid the memory leak because whenever either of the two refcounts falls to one, we see whether the other refcount is also at 1 -- and if that's the case, we destroy the object. The problem I see here is that we don't necessarily have control over what happens during the host refcounting. In Perl at least, I don't know of a way to override what happens during SvREFCNT_dec, because it's not a method. So we can't set up a trigger that fires when the Perl object wrapper's refcount falls to 1 -- the only event we can count on is the call to $obj->DESTROY() when the perl reference count falls to 0. > This may also be better for non-refCount languages (Java). How would maintaining a refcount for a Java object work? Under a tracing GC system, I think you could say that the refcount is "true" until the object becomes unreachable, at which point it's "false". Other than the fact that the object is reachable, you don't know how many things are pointing at it. Because of that, there's no way to detect when the only reference left is the one from the Lucy object. > > However, that's not a major problem unless we're creating and > > destroying a boatload of small objects. ---->8 snip 8<---- > I think Lucene has improved here too, especially on the indexing side > (though the searching side doesn't create too many tiny objects I > think). Nothing like indexing. The only spot I can think of is the term dictionary index, with all those Term and TermInfo objects. In Java, you could address that by turning a 1000-object array of 5-element TermInfo objects into 5 primitive-type arrays with 1000 slots each. We want to do something like that anyway because of flexible indexing. In KS, we no longer generate a lot of Term/TermInfo objects because that data structure has been modified for use with mmap. However, we'll want to mod that system further for flexible indexing and PFOR. I look forward to taking that up in the future. :) > > literals, and such in a bootstrap routine, but it's enough of a pain > > to set something like that up that I haven't gone and made such a > > change in KS. > > > > I'd really like to kill of FastObj just for the sake of simplicity, > > though. > > What objects are planned to subclass FastObj? VTable, plus the CharBuf family. Right now in KS, the following classes also descend from FastObj: VArray Posting Token Inversion (nee TokenBatch) ByteBuf ViewByteBuf Undefined However, none of them have the declaration/bootstrapping problem presented by VTable and CharBuf. Additionally, I hope to eliminate Token (by expanding the capabilities of Inversion), Posting is due for an index-time overhaul but isn't a problem at search-time, Undefined isn't important, and ByteBuf and ViewByteBuf don't present speed problems. Rebasing VArray and Inversion (which is a subclass of VArray) to be children of Obj rather than FastObj causes a performance dip of about 3% on the indexing benchmark. I imagine that's as bad as it would be for any binding because Perl objects are pretty heavy. > > The Ferret scheme won't cause problems with light usage of the > > library, because most of Lucy's work will be done within tight loops > > in the C core. > > What about a HitCollector in the host language? Can you efficiently > re-use an object from the host language? (Python has such tricks, eg > to re-use a TupleObject during iteration). Oh, definitely a HitCollector would be a problem. Restating the original assertion: the Ferret scheme won't cause problems if you avoid host-language subclasses. > > Do we need to > > add all new objects to a giant Hash that we tell the host about, and > > yank C stack vars out of that Hash before the C function returns? > > What does Ruby's C-embedding API expose to interact with its GC? I > would imagine it'd be similar to Java's (ie "here's a new global root > object")? I understand you can do something like that. > I don't understand the C stack vars / hash question. Consider the following routine, set up for tracing garbage collection: void FSFolder_delete(FSFolder *self, const CharBuf *filename) { CharBuf *fullpath = (CharBuf*)GC_add( CB_newf("%o/%o", self->path, filename)); bool_t result = Host_callback_i(self, "do_delete", 1, ARG_STR("path", fullpath)); if (!result) { THROW("failed to delete %o", fullpath); } GC_remove(fullpath); } Assume that the CharBuf formatted constructor CB_newf() caches a Host obj within the CharBuf, and that no reference counting is performed because we're going to rely upon the host's tracing garbage collector. When "fullpath" is created at the top of our function, it's unreferenced as far as the Host's garbage collector is concerned. The only place it exists is on the C stack -- but the garbage collector doesn't walk the C stack looking for roots. It walks the host's stack, but not the C stack. As soon as fullpath is born, it is a candidate for garbage collection, and could be reclaimed at any moment. Therefore, we *have* to call GC_add() to inform the Host about our new CharBuf. If we *don't* call GC_add(), we could potentially get an invalid read error while composing the message for THROW. Say that Host_callback_i doesn't send "fullpath" itself into host-space, but instead copies its contents into a host string type. (This is actually what happens in the current KS implementation.) Now, say that a GC takes place during Host_callback_i(). Uh-oh, "fullpath" is toast -- it was unreferenced, and got reclaimed. If THROW needs it, we'll be reading from freed memory. Our only remedy is to call GC_add() early on, protecting fullpath until we know we don't need it any more. But calling GC_add() means that we have to call GC_remove() somewhere, or we'll leak memory. That call to GC_remove() at the end was what I meant by "yank C stack vars out of that Hash before the C function returns". Now, it turns out that the same function set up for a refcounting host looks pretty similar... eliminate the call to GC_add(), and change the GC_remove() to a DECREF(): void FSFolder_delete(FSFolder *self, const CharBuf *filename) { CharBuf *fullpath = CB_newf("%o/%o", self->path, filename); bool_t result = Host_callback_i(self, "do_delete", 1, ARG_STR("path", fullpath)); if (!result) { THROW("failed to delete %o", fullpath); } DECREF(fullpath); } Perhaps it's possible to somehow have the DECREF() op perform double duty and trigger GC_remove() on a garbage collecting host. I dunno, that seems likely to cause double-free problems, but I'll try thinking that through in a future post. Marvin Humphrey