incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCY-5) Boilerplater compiler
Date Fri, 13 Mar 2009 17:34:00 GMT


Marvin Humphrey commented on LUCY-5:

> I know there are issues with it, but... have you considered simply using
> C++, which has already created OO over C (vtables, etc.)? Or are there
> hopeless problems with its approach for Lucy?

Yes, we considered it.  First, C++ would severely constrain core library
development, because of the vtable ABI issue -- applying that KDE
compatibility document to Lucy would be unacceptable.  Second, we would not be
able to achieve the same level of integration between the bindings and the
core, because the C++ standard does not specify how dynamic dispatch should be

> Another question: it seems like you are going to great lengths to achieve
> "no recompilation back compatibility". Meaning, eg, if you've someone has
> built Python bindings to version X of Lucy, and you've made some
> otherwise-back-compatibile changes to the exposed API and release version
> X+1, you'd like for those Python bindings to continue to work w/o
> recompilation when someone drops in Lucy X+1 (as a dynamic library), right?

No, that's not the use case we're concerned with.

The idea is that if you install an independent third-party compiled extension
like "LucyX::RTree", it should still work after you upgrade the Lucy core.  

Using Perl/CPAN as an example, consider the following sequence of events:

  1. Install Lucy 1.00 via CPAN.
  2. Install LucyX::RTree via CPAN.
  3. Upgrade Lucy to version 1.01 via CPAN.

If we do not preserve Lucy's binary compatibility from version 1.00 to 1.01,
apps which use LucyX::RTree will suddenly start crashing hard immediately
after the upgrade finishes.  That's not acceptable.

> Couldn't you require that the bindings are rebuilt & recompiled when Lucy
> X+1 is released?

Yes, I think that's a reasonable requirement.  

Actually, I don't think we're going to be able to use the same shared object
with multiple bindings.  Python will need its own, C will need its own, Perl
will need its own, etc -- and therefore, there will never be a normal case
where the bindings and the core library are out of sync.

The reason the core cannot be shared is that each binding has to implement
some functions which are declared by the core but left unimplemented -- for
example, the functions which implement callbacks to the host.  The object code
from those implementations will end up in the shared object.

> But if added methods always went to the end of the vtable, wouldn't things
> work fine, as long as you had bounds checking so that if new code tried to
> look up a new method on old compiled code it would see it's not there?

That won't work.  Say that we have a core class "Dog" with two methods, bark()
and bite(), and an externally compiled subclass "Boxer" which overrides bark()
and adds drool().

Dog_vtable = {

Boxer_vtable = {

Now say that we add eat(Food *food) to the base class Dog:

Dog_vtable = {

Unfortunately, the externally compiled Boxer_vtable has a fixed layout, and it
puts Boxer_drool in the slot where the core expects to find eat().  When the
core tries to call eat() on a Boxer object, chaos will ensue.

>> Here's the method-invocation wrapper for Scorer_Next.

> This seems like a fair amount of overhead per-invocation. 

Not true.  :)  The C code is verbose, but the assembler is compact and the
machine instructions are cheap.  From

    ... I can detect no impact on performance using the indexing  
    benchmark script, even after changing InStream and OutStream from  
    FINAL_CLASS to CLASS so that their methods go through the dispatch  
    table rather than resolve to function addresses.  I speculate that  
    because all the extra instructions are pipeline-able, they're nearly  
    indistinguishable from free.

Double dereference vtables are a standard technique for for implementing
dynamic dispatch in C++, Java, etc.  The only thing we're doing differently is
loading the offset from a variable.

The "inside-out" aspect of using individual variables to hold the offsets was
inspired by the "inside-out object" technique drawn from Perl culture.
However, the idea of using variable vtable offsets has been studied before,
and is actually implemented in GCJ.  

See "Supporting Binary Compatibility with Static Compilation" by Dachuan Yu,
Zhong Shao, and Valery Trifonov, at

> Is it possible/OK for the caller to grab the next method up front and then
> invoke it itself?

Yes.  In fact, I don't think there's any harm in making that part of the
public API, because we're already committed by the ABI requirements.

> Would "core" scorers be able to somehow bypass this lookup?


In addition... Since all vtable offsets are constant for a given core compile,
we could actually define our method invocation symbols differently if e.g.
LUCY_CORE is defined, avoiding the extra variable lookup.  

>> Each binding will have to implement lucy_Native_callback_i() and a few other
>> methods declared by Native.
> Native in this case means the dynamic language, right? Ie,
> lucy_Native_callback_i would invoke my Python method for "next", when I've
> defined next in Python in my Matcher subclass?

Yes, that's the idea.  

"Native" may not be the best name for that module, especially since it has
exactly the opposite meaning in Java. :)  How about "Host", instead?

> Boilerplater compiler
> ---------------------
>                 Key: LUCY-5
>                 URL:
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message