incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Custom analyzers
Date Mon, 14 Nov 2011 21:22:15 GMT
On Mon, Nov 14, 2011 at 07:45:36PM +0100, Nick Wellnhofer wrote:
>
> I'm trying to write my own analyzer class that strips accents and does  
> some other transformations. I had a look at Father Chrysostomos'  
> KSx::Analysis::StripAccents and tried to get something similar to run  
> with Lucy 0.2.2. With the following two changes I could make it work:
>
> - The 'transform' method can't reuse the inversion argument but must  
> return a new inversion.

Lucy::Analysis::SnowballStemmer#transform reuses its Inversion; it should work
for you as well.  Perhaps you need to invoke Inversion#reset to reset the
iterator?

> Are there any other caveats? Is there any documentation on how to write  
> your own analyzer classes?

The subclassing API for Analyzer was redacted prior to Lucy 0.1 in
anticipation of refactoring; Lucy::Analysis::Inversion and
Lucy::Analysis::Token are not public classes.  So what you are trying to do is
not officially supported.

That said, we know that we need to restore this capability.  The more people
who are hacking on the Lucy core analysis code, the sooner we will be able to
do so.

> If anyone is interested in a LucyX::Analysis::StripAccents module, I  
> could put something up on CPAN.

If we were to handle this as a contribution to Lucy itself, so that
LucyX::Analysis::StripAccents would be distributed alongside other LucyX
modules such as the LucyX::Remote classes, that would allow us change the
internal implementation for analysis without causing downstream disruption of
an independent CPAN distro for LucyX::Analysis::StripAccents.

If we go down that path, there are some licensing issues that would need to
be resolved.  We'd need Father Chrysostomos on board (which I hope would be
doable), but then there's also the issue of the Text::Unaccent dependency.

Let us know if you'd like to explore that option further.

Marvin Humphrey


Mime
View raw message