Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Mon, 14 Nov 2011 20:22:09 -0800
From: Marvin Humphrey <marvin@rectangular.com>
To: lucy-user@incubator.apache.org
Message-ID: <20111115042209.GA27084@rectangular.com>
References: <4EC161D0.1060103@aevum.de>
 <20111114212215.GA26256@rectangular.com> <4EC1C342.7080401@aevum.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4EC1C342.7080401@aevum.de>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [lucy-user] Custom analyzers

On Tue, Nov 15, 2011 at 02:41:22AM +0100, Nick Wellnhofer wrote:
>>> Are there any other caveats? Is there any documentation on how to write
>>> your own analyzer classes?
>>
>> The subclassing API for Analyzer was redacted prior to Lucy 0.1 in
>> anticipation of refactoring; Lucy::Analysis::Inversion and
>> Lucy::Analysis::Token are not public classes.  So what you are trying to do is
>> not officially supported.
>>
>> That said, we know that we need to restore this capability.  The more people
>> who are hacking on the Lucy core analysis code, the sooner we will be able to
>> do so.
>
> Are there any additional pointers for people who'd like to hack on this?

Let's see.

One thing to bear in mind is that Analyzer performance from Perl is going to
be considerably slower than from C.  There's a lot of string copying that has
to happen in order to make token data available in Perl-space.

That may drive you to port your Analyzer module to C eventually, which would
be an effective route to gaining familiarity with the Lucy indexing chain, as
all that core code is in C.

Aside from that... the current Analyzer codebase is stable and fine -- I just
want to find ways to speed it up.  I tried to optimize it with some memory
allocation tricks a while back (allocating tokens from memory pools), but the
results were disappointing -- the changes failed to produce speedups which
justified the complexity costs.

Lucene uses a completely different Analysis model from Lucy, where Analyzers
mutate a single token rather than processing all tokens at once before handing
off to the next Analyzer in the chain.  It's less intuitive, and I'm not
actually sure whether it would be faster, but it's worth a try.  That's unlikely
to happen until somebody besides me decides they really care about Analyzer
speed, though.

Lastly, it would be nice to have a cookbook entry on subclassing Analyzer, but
which would live on the Lucy wiki for now.  I'm probably in the best position
to write that up -- I might remember other pointers which would benefit you
while doing so.  However I'm starting to fall behind on dealing with all these
user inquiries and support questions of late... 

> Thinking more about it, Unicode normalization would also be a nice feature
> for the Lucy analyzer.

I would think that you would run your text through Unicode normalization prior
to indexing it.  But I suppose that Analyzers might produce tokens from
normalized source text which would not be normalized themselves.

> Would it make sense to have all the Unicode functionality in the Lucy  
> core using a third party Unicode library? Or should we rely on the  
> Unicode support of the host language like we do for case folding?

That hinges on the dependability, portability, licensing terms and
ease-of-integration for this theoretical third party Unicode library.
Dependencies are cool so long as we can bundle them, they don't take a million
years to compile, they don't sabotage all the hard work we've done to make
Lucy portable, etc.  (For a longer take on dependencies, see
<http://markmail.org/message/2zsunkfleqocix67>.)

In the absence of a suitable library, we can continue to fall back on the host
language support, which will at least be reliable and portable.

Marvin Humphrey