lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikola Tulechki <nikola.tulec...@gmail.com>
Subject Re: [lucy-user] Snowball stemmer stoplists
Date Thu, 17 Jan 2013 16:11:15 GMT
Thanks NIck,

I was secretly hoping that there is a built-in functionnality that
does this. Unfortunatly the solution is more complex.
I'll look into it.
Best

NT

On Thu, Jan 17, 2013 at 2:48 PM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
> On 17/01/2013 10:21, Nikola Tulechki wrote:
>>
>> Hello
>>
>> I am using lucy on a technical documentation and I have a bunch of
>> acronyms that must not be stemmed.
>> Is there a way to add a stoplist to the stemmer so it skips some terms?
>
>
> It can be done, but it's not trivial and probably not very performant.
> First, you have to write your own Analyzer class in Perl. See the following
> threads for some guidance:
>
> http://mail-archives.apache.org/mod_mbox/lucy-user/201111.mbox/%3C4EC161D0.1060103@aevum.de%3E
> http://mail-archives.apache.org/mod_mbox/lucy-user/201207.mbox/%3C4FF1A2B7.7060207@easyconnect.no%3E
>
> We really need a cookbook entry describing how to write custom analyzers.
> But to get started, here is some minimal skeleton code that I have used in
> the past:
>
>     package My::Custom::Analyzer;
>     use strict;
>
>     use base qw(Lucy::Analysis::Analyzer);
>
>     sub new {
>         my ($class, %args) = @_;
>         my $self = $class->SUPER::new(%args);
>
>         # Setup your analyzer here
>
>         return $self;
>     }
>
>     sub transform {
>         my ($self, $inversion) = @_;
>
>         while (my $token = $inversion->next) {
>             my $text = $token->get_text;
>
>             # Transform $text here
>
>             $token->set_text($text);
>         }
>
>         $inversion->reset;
>         return $inversion;
>     }
>
>     sub equals {
>         return 1;
>     }
>
>     1;
>
> For a proper implementation, you should also provide "dump" and "load"
> methods and a real "equals" method but they're not really necessary for a
> one-off job. Only remember to always reindex after changing the parameters
> of your custom analyzer. Without "dump", "load" and "equals" you won't get
> an error message in this case.
>
> Your custom analyzer should then stem the words that are not in your
> stoplist one by one using the (undocumented) "split" method. So the
> "transform text" part of your analyzer would look like:
>
>     if (!$stoplist->{$text}) {
>         my $tokens = $stemmer->split($text);
>         $text = $tokens->[0];
>     }
>
> Also note that you have to store member variables like "stoplist" and
> "stemmer" of your analyzer class using the "inside-out" approach (one global
> hash per variable). You'll find some example code showing how to do that in
> the threads I mentioned above.
>
> Nick
>

Mime
View raw message