incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Reference counting inside a GC host (was "real time updates")
Date Mon, 30 Mar 2009 01:20:03 GMT
On Sun, Mar 29, 2009 at 05:55:11AM -0400, Michael McCandless wrote:

> > FWIW, the cached-host-object algo wins the tight loop performance contest.
> Which makes sense, but how often are such loops [expected to be] used
> in practice?  EG we already thought Collector inside host would be
> doable but expected poor performance.

The performance hit will matter whenever someone overrides a tight-loop
method.  How much it matters will vary from host to host.

Probably any Scorer or Collector would feel it.  DataReader: it depends on
whether there are Scorer implementations which call on it in tight loops.

InStream and OutStream: definitely would have, except that they're final
classes so you can't override any methods.

Analyzer: maybe not so much because the analysis method would be non-trivial.
(Assuming that we settle on something like the KS "Inversion"-style Analyzers
rather than the method-call-orgy of Lucene TokenStream.)

Searcher, Indexer, Query, QueryParser, DataWriter, IndexManager, Lock: no,
because they don't operate light methods within tight loops.  FieldSpec,
Schema: I don't think so because so far it's been possible to cache the return
values of certain methods which would otherwise be called often.
(KinoSearch::Index::Inverter performs such caching.)

> How much work is being done when you create a new host-wrapper object?

That will vary from Host to Host.  In Perl, we're using blessed scalars, which
are about as cheap as Perl objects get.

However, Perl method dispatch is comparatively slow; if method calls were
faster, the effect would loom larger.  In Java, for instance, method calls are
very fast, and creating a host wrapper object on each loop will probably cause
a significant degradation much sooner.  

Put another way: caching a host wrapper object probably matters a lot more
when method dispatch is fast.

> Could you somehow re-use such objects?  (Sort of like how when you go
> to a children's museum, to the water section, they provide a bunch of
> yellow frocks that the kids can wear when they play with the water,
> and return when they are done).  

Evocative metaphor. :)  So, who manages the pool?  

Ideally, when we know we're going to be calling back to the host in a tight
loop, we'd want the caller to cache the host object.  But that's not easy.

Here's the basic implementation of Matcher_Collect():

  Matcher_collect(Matcher *self, Collector *collector)
    i32_t doc_num;
    Collector_Set_Matcher(collector, self);
    while (0 != (doc_num = Matcher_Next(self))) {

Say that the user has overridden Collector's Collect() method in the host, so
it's slow.  To get best performance, we need to change how *Matcher* works,
even though it's *Collector* that's the source of the slowdown.

  Matcher_collect(Matcher *self, Collector *collector)
    i32_t  doc_num;
    void  *host_obj = Coll_To_Host(collector); /* <--- Cache wrapper here */

    Collector_Set_Matcher(collector, self);
    while (0 != (doc_num = Matcher_Next(self))) {
      Host_callback(host_obj, "collect", 1, ARG_I32("doc_num", doc_num));

    Coll_Destroy_Host(collector, host_obj); /* <--- Destroy wrapper here */

(Host_callback() ordinarily takes a Lucy object as its first argument, but
pretend that it takes a host wrapper object for now.)

Coll_Collect doesn't know it's being called in a loop.  Even if it did, it
could only cache host object wrappers using static or global vars. (Bye-bye
thread safety, and what on earth do we do about subclasses?)

  static void *Coll_collect_HOST_OBJ = NULL;
  Coll_collect_OVERRIDE(Collector* self, i32_t doc_num)
    /* Bleah. */
    Coll_collect_HOST_OBJ = Coll_Validate_Or_Make_Wrapper(self, 
    Host_callback(Coll_collect_HOST_OBJ, "collect", 1, 
        ARG_I32("doc_num", doc_num));

> > We lose convenient use of the "inside-out-object" pattern. :(
> Neat.... I've seen a similar approach (though likely someone has given
> / would give it a different named pattern) 

The "FlyWeight" design pattern is very similar.

> In your example, shouldn't/couldn't the hash key somehow be the id of
> the *Lucy* object, not of its fleeting host wrapper obj?  

Actually, that's what $$self is.

   $self          <---- address of Perl object[1]
   $$self         <---- address of Lucy object

However, invoking $self->DESTROY on multiple wrappers around the same Lucy
object is still a problem.

  sub DESTROY {
      my $self = shift;
      delete $foo{$$self};
      $self->SUPER::DESTROY;  # Calls Dec_RefCount(), not Destroy() as it
                              # would under the cached-object model.

As soon as the wrapper has its refcount fall to 0, Perl invokes DESTROY, and
despite the fact that the Lucy object doesn't get zapped yet, $foo{$$self}
gets deleted.

Marvin Humphrey

[1] Slight oversimplification.

View raw message