incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCY-5) Boilerplater compiler
Date Sat, 14 Mar 2009 09:25:50 GMT


Michael McCandless commented on LUCY-5:

"Native" may not be the best name for that module, especially since it has
exactly the opposite meaning in Java.  How about "Host", instead?

I like Host!

The idea is that if you install an independent third-party compiled extension
like "LucyX::RTree", it should still work after you upgrade the Lucy

I guess the assumption is if I install N packages that use Lucy core
for a given Host language X (eg Python) and presumably given major
version of Lucy core, they will centrally reference the core shared
lib (vs compiling statically, or using a "private" shared lib)?

I can understand why KDE needs to use shared libs -- a zillion
installed apps link against those libs.  For Lucy, it's less clear
this should be a requirement, though it's certainly nice.

Actually, I don't think we're going to be able to use the same shared object
with multiple bindings. Python will need its own, C will need its own, Perl
will need its own, etc - and therefore, there will never be a normal case
where the bindings and the core library are out of sync.

I see -- the compiled Lucy core shared lib will be host-language
specific.  So somewhere centrally (/usr/local/lib or something) I'd
have something like this:

(Where python had two major releases, 1 and 2, of Lucy installed).
Hmm actually you'd also have to separate out the version of the host
language each of these were built against.

These may symlink to the particular minor releases for each of those
that are currently installed.

That won't work. Say that we have a core class "Dog" with two methods, bark()
and bite(), and an externally compiled subclass "Boxer" which overrides bark()
and adds drool().

OK -- the binary compatibility challenge makes sense now -- thanks for
the tutorial (I should've gone and read that KDE doc the first time

> This seems like a fair amount of overhead per-invocation.

Not true.   The C code is verbose, but the assembler is compact and the
machine instructions are cheap.

Interesting and strange!  And unexpectedly pleasantly surprising...


> Would "core" scorers be able to somehow bypass this lookup?


In addition... Since all vtable offsets are constant for a given core compile,
we could actually define our method invocation symbols differently if e.g.
LUCY_CORE is defined, avoiding the extra variable lookup.

Would the core default impls for classes like default MergePolicy,
MergeScheduler, IndexDeletionPolicy, HitCollector, Analyzer,
Tokenizer, TokenFilter, etc all be implemented in C?

OK so it sounds like calling functions/methods fit into various

  * Entirely Lucy internal --> just call the function directly, so
    normal C compilation handles this.

  * Lucy invokes "dynamically dispatched" API (ie API that could be
    implemented in the host language, eg when I subclass Analyzer,
    HitCollector, IndexDeletionPolicy, etc.), but in the current
    context we are using an object in C and so we bypass the dynamic
    dispatch.  This path remains fast?

  * Lucy invokes "dynamically dispatched" API, and in fact its impl in
    the current context is defined in the host language, so we go
    through the full dynamic dispatch.

How easy will it be to subclass in the host language?  EG, for
PyLucene I have to make a 'stub' class in Java first:

I assume Lucy has a similar requirement, ie we must decide up front
which methods are "dynamically dispatchable" and ensure Lucy always
invokes those methods dynamically.

> Boilerplater compiler
> ---------------------
>                 Key: LUCY-5
>                 URL:
>             Project: Lucy
>          Issue Type: New Feature
>          Components: Boilerplater
>            Reporter: Marvin Humphrey
>            Assignee: Marvin Humphrey
> Boilerplater is a small compiler which supports a vtable-based object model.
> The output is C code which adheres to the design that Dave Balmain and I
> hammered out a while back; the input is a collection of ".bp" header files.
> Our original intent was to pepper traditional C ".h" header files with no-op
> macros to define each class's interface; the code generator would understand
> these macros but the C compiler would ignore them.  C source code files would
> then pound-include both the ".h" header and the auxiliary, generated ".bp"
> file.
> The problem with this approach is that C syntax is too constraining.  Because
> C does not support namespacing, every symbol has to be prepended with a prefix
> to avoid conflicts.  Futhermore, adding metadata to declarations (such as
> default values for arguments, or whether NULL is an acceptable value) is
> awkward.  The result is ".h" header files that are excessively verbose,
> cumbersome to edit, and challenging to parse visually and to grok.
> The solution is to make the ".bp" file the master header file, and write it in
> a small, purpose-built, declaration-only language.  The
> code-generator/compiler chews this ".bp" file and spits out a single ".h"
> header file for pound-inclusion in ".c" source code files.
> This isn't really that great a divergence from the original plan.  There's no
> fixed point at which a "code generator" becomes a "compiler", and while the
> declaration-only header language has a few conventions that core developers
> will have to familiarize themselves with, the same was true for the no-op
> macro scheme.  Furthermore, the Boilerplater compiler itself is merely an
> implementation detail; it is not publicly exposed and thus can be modified at
> will.  Users who access Lucy via Perl, Ruby, Java, etc will never see it.
> Even Lucy's C users will never see it, because the public C API itself will be
> defined by a lightweight binding and generated documentation.
> The important thing for us to focus on is the *output* code generated by
> Boilerplater.  We must nail the object model.  It has to be fast.  It has to
> live happily as a symbiote within each host.  It has to support callbacks into
> the host language, so that users may define custom subclasses and override
> methods easily.  It has to present a robust ABI that makes it possible to
> recompile an updated core without breaking compiled extensions (like Java,
> unlike C++).  
> The present implementation of the Boilerplater compiler is a collection of
> Perl modules: Boilerplater::Type, Boilerplater::Variable,
> Boilerplater::Method, Boilerplater::Class, and so on.  One CPAN module is
> required, Parse::RecDescent; however, only core developers will need either
> Perl or Parse::RecDescent, since public distributions of Lucy will 
> contain pre-generated code.  Some of Boilerplater's modules have kludgy 
> internals, but on the whole they seem to do a good job of throwing errors rather 
> than failing subtly.
> I expect to submit individual Boilerplater modules using JIRA sub-issues which
> reference this one, to allow room for adequate commentary.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message