lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-dev] Sample extension for Lucy
Date Thu, 26 Apr 2012 22:40:46 GMT
On Thu, Apr 26, 2012 at 8:45 AM, Nick Wellnhofer <> wrote:
> I just published a compiled extension for Lucy on Github:


After a couple tweaks, I got it to build and pass its test suite.  :)

> It's a simple whitespace tokenizer that's not meant to be used in production
> but to serve as a sample extension for development Here are some notes on
> stuff that's still to do:
> Currently, we use the last component of the module name as parcel. This
> results in very long symbol names in the case of WhitespaceTokenizer.
> We should add a "parcel" build parameter to Clownfish::CFC::Perl::Build, so
> we can use something shorter like "WSToker".

The Clownfish header language currently supports such a feature in its parcel

    parcel KinoSearch cnick Kino;

We had exactly the same issue that you're dealing with when the library had
the 10-letter name "KinoSearch" instead of the 4-letter name "Lucy". :)
See the patch below my signature for the impact of making such a change.

I'm not happy with that API for specifying parcel nicknames, though; it was
just enough to get the job done at the time, and parcels really need to
support a wide variety of properties.

My preferred mechanism now would be to have dedicated Clownfish parcel files
with ".cfp" extensions, encoded as JSON.  Something like this, in the file

        "parcel": "com::github::nwellnhof::WhitespaceTokenizer",
        "version": "v0.1.0"
        "nickname": "WSToker",
        "requires": {
            "org::apache::lucy::Lucy": "v0.4.0"

> In WhitespaceTokenizer.cfh I had to add a __C__ block that includes
> Lucy/Analysis/Inversion.h because the generated XS needs the LUCY_INVERSION
> VTable. That's not ideal.

Agreed.  IMO what it should look like instead is either something like this...

    parcel WhitespaceTokenizer;
    use Lucy::Analysis::Tokenizer from org::apache::lucy::Lucy v0.4;

... or like this:

    parcel WhitespaceTokenizer;
    use org::apache::lucy::Lucy v0.4;

> As previously mentioned, all Lucy types used in WhitespaceTokenizer.cfh have
> to be prefixed with "lucy_".

Yes.  This is high on my TODO list, but right this moment I have my teeth in a
potential solution for the fragile member var ABI issue. :)

> There's an intricate problem with XSLoader that only manifests when running
> the tests. See the comment in

Sheesh, how goofy.  I hope we find a fix for that eventually, so that
nobody else has to go down the lengthy debugging path you must have gone down.

Here are the two tweaks I had to make to build and pass tests on OS X:

  * Hack in a Clownfish include directory manually for when Clownfish::CFC and
    Lucy are not installed into $Config{installsitearch}.
  * Comment out the "extra_linker_flags" argument in Build.PL.

On OS X, you get this error with you try to link against Lucy.bundle:

ld: in /Users/marvin/Desktop/scratchlib/lib/perl5/darwin-2level/auto/Lucy/Lucy.bundle,
can't link with bundle (MH_BUNDLE) only dylibs (MH_DYLIB)
collect2: ld returned 1 exit status

Here's the explanation:
    One Mach-O feature that hits many people by surprise is the distinction
    between shared libraries and dynamically loadable modules. On ELF systems
    both are the same; any piece of shared code can be used as a library and
    for dynamic loading.

This leads me to question whether extensions should be linking against the
shared objects of their dependencies on *any* platform.  Does that create a
tight binding that will e.g. break WhitespaceTokenizer when Lucy gets updated?
I'm pretty sure we want runtime relocation of upstream symbols.

> It's very illustrative to look at code that's created in autogen when
> building the extension, especially autogen/source/parcel.c.

Cool -- and parcel.c for WhitespaceTokenizer is only 133 lines long, as
opposed to over 33000 for all of Lucy!

Marvin Humphrey

--- a/core/LucyX/Analysis/WhitespaceTokenizer.c
+++ b/core/LucyX/Analysis/WhitespaceTokenizer.c
@@ -1,7 +1,7 @@

 #define C_LUCY_TOKEN
 #include "Lucy/Util/ToolSet.h"

@@ -13,7 +13,7 @@
 #include <ctype.h>

-whitespacetokenizer_init_parcel() {
+wstoker_init_parcel() {

diff --git a/core/LucyX/Analysis/WhitespaceTokenizer.cfh
index 78bc2c9..cab1fce 100644
--- a/core/LucyX/Analysis/WhitespaceTokenizer.cfh
+++ b/core/LucyX/Analysis/WhitespaceTokenizer.cfh
@@ -1,4 +1,4 @@
-parcel WhitespaceTokenizer;
+parcel WhitespaceTokenizer cnick WSToker;

 #include "Lucy/Analysis/Inversion.h"
diff --git a/perl/buildlib/ b/perl/buildlib/
index e9ad98b..820be33 100644
--- a/perl/buildlib/
+++ b/perl/buildlib/
@@ -12,7 +12,7 @@ sub bind_all {
 MODULE = LucyX::Analysis::WhitespaceTokenizer    PACKAGE =

-    whitespacetokenizer_LucyX__Analysis__WhitespaceTokenizer_bootstrap();
+    wstoker_LucyX__Analysis__WhitespaceTokenizer_bootstrap();

     my $pod_spec = Clownfish::CFC::Binding::Perl::Pod->new;

View raw message