lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Sample extension for Lucy
Date Thu, 26 Apr 2012 22:40:46 GMT
On Thu, Apr 26, 2012 at 8:45 AM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
> I just published a compiled extension for Lucy on Github:
>
> https://github.com/nwellnhof/LucyX-Analysis-WhitespaceTokenizer

Awesome.

After a couple tweaks, I got it to build and pass its test suite.  :)

> It's a simple whitespace tokenizer that's not meant to be used in production
> but to serve as a sample extension for development Here are some notes on
> stuff that's still to do:
>
> Currently, we use the last component of the module name as parcel. This
> results in very long symbol names in the case of WhitespaceTokenizer.
> We should add a "parcel" build parameter to Clownfish::CFC::Perl::Build, so
> we can use something shorter like "WSToker".

The Clownfish header language currently supports such a feature in its parcel
declarations:

    parcel KinoSearch cnick Kino;

We had exactly the same issue that you're dealing with when the library had
the 10-letter name "KinoSearch" instead of the 4-letter name "Lucy". :)
See the patch below my signature for the impact of making such a change.

I'm not happy with that API for specifying parcel nicknames, though; it was
just enough to get the job done at the time, and parcels really need to
support a wide variety of properties.

My preferred mechanism now would be to have dedicated Clownfish parcel files
with ".cfp" extensions, encoded as JSON.  Something like this, in the file
"core/WhitespaceTokenizer.cfp":

    {
        "parcel": "com::github::nwellnhof::WhitespaceTokenizer",
        "version": "v0.1.0"
        "nickname": "WSToker",
        "requires": {
            "org::apache::lucy::Lucy": "v0.4.0"
        }
    }

> In WhitespaceTokenizer.cfh I had to add a __C__ block that includes
> Lucy/Analysis/Inversion.h because the generated XS needs the LUCY_INVERSION
> VTable. That's not ideal.

Agreed.  IMO what it should look like instead is either something like this...

    parcel WhitespaceTokenizer;
    use Lucy::Analysis::Tokenizer from org::apache::lucy::Lucy v0.4;

... or like this:

    parcel WhitespaceTokenizer;
    use org::apache::lucy::Lucy v0.4;

> As previously mentioned, all Lucy types used in WhitespaceTokenizer.cfh have
> to be prefixed with "lucy_".

Yes.  This is high on my TODO list, but right this moment I have my teeth in a
potential solution for the fragile member var ABI issue. :)

> There's an intricate problem with XSLoader that only manifests when running
> the tests. See the comment in WhitespaceTokenizer.pm.

Sheesh, how goofy.  I hope we find a fix for that eventually, so that
nobody else has to go down the lengthy debugging path you must have gone down.

Here are the two tweaks I had to make to build and pass tests on OS X:

  * Hack in a Clownfish include directory manually for when Clownfish::CFC and
    Lucy are not installed into $Config{installsitearch}.
  * Comment out the "extra_linker_flags" argument in Build.PL.

On OS X, you get this error with you try to link against Lucy.bundle:

ld: in /Users/marvin/Desktop/scratchlib/lib/perl5/darwin-2level/auto/Lucy/Lucy.bundle,
can't link with bundle (MH_BUNDLE) only dylibs (MH_DYLIB)
collect2: ld returned 1 exit status

Here's the explanation:

    http://stackoverflow.com/questions/2339679/what-are-the-differences-between-so-and-dylib-on-osx
    One Mach-O feature that hits many people by surprise is the distinction
    between shared libraries and dynamically loadable modules. On ELF systems
    both are the same; any piece of shared code can be used as a library and
    for dynamic loading.

This leads me to question whether extensions should be linking against the
shared objects of their dependencies on *any* platform.  Does that create a
tight binding that will e.g. break WhitespaceTokenizer when Lucy gets updated?
I'm pretty sure we want runtime relocation of upstream symbols.

> It's very illustrative to look at code that's created in autogen when
> building the extension, especially autogen/source/parcel.c.

Cool -- and parcel.c for WhitespaceTokenizer is only 133 lines long, as
opposed to over 33000 for all of Lucy!

Marvin Humphrey

--- a/core/LucyX/Analysis/WhitespaceTokenizer.c
+++ b/core/LucyX/Analysis/WhitespaceTokenizer.c
@@ -1,7 +1,7 @@
-#define WHITESPACETOKENIZER_USE_SHORT_NAMES
+#define WSTOKER_USE_SHORT_NAMES
 #define LUCY_USE_SHORT_NAMES

-#define C_WHITESPACETOKENIZER_WHITESPACETOKENIZER
+#define C_WSTOKER_WHITESPACETOKENIZER
 #define C_LUCY_TOKEN
 #include "Lucy/Util/ToolSet.h"

@@ -13,7 +13,7 @@
 #include <ctype.h>

 void
-whitespacetokenizer_init_parcel() {
+wstoker_init_parcel() {
 }

 WhitespaceTokenizer*
diff --git a/core/LucyX/Analysis/WhitespaceTokenizer.cfh
b/core/LucyX/Analysis/WhitespaceTokenizer.cfh
index 78bc2c9..cab1fce 100644
--- a/core/LucyX/Analysis/WhitespaceTokenizer.cfh
+++ b/core/LucyX/Analysis/WhitespaceTokenizer.cfh
@@ -1,4 +1,4 @@
-parcel WhitespaceTokenizer;
+parcel WhitespaceTokenizer cnick WSToker;

 __C__
 #include "Lucy/Analysis/Inversion.h"
diff --git a/perl/buildlib/Binding.pm b/perl/buildlib/Binding.pm
index e9ad98b..820be33 100644
--- a/perl/buildlib/Binding.pm
+++ b/perl/buildlib/Binding.pm
@@ -12,7 +12,7 @@ sub bind_all {
 MODULE = LucyX::Analysis::WhitespaceTokenizer    PACKAGE =
LucyX::Analysis::WhitespaceTokenizer

 BOOT:
-    whitespacetokenizer_LucyX__Analysis__WhitespaceTokenizer_bootstrap();
+    wstoker_LucyX__Analysis__WhitespaceTokenizer_bootstrap();
 END_XS_CODE

     my $pod_spec = Clownfish::CFC::Binding::Perl::Pod->new;

Mime
View raw message