lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-user] Snowball stemmer stoplists
Date Thu, 17 Jan 2013 13:48:22 GMT
On 17/01/2013 10:21, Nikola Tulechki wrote:
> Hello
>
> I am using lucy on a technical documentation and I have a bunch of
> acronyms that must not be stemmed.
> Is there a way to add a stoplist to the stemmer so it skips some terms?

It can be done, but it's not trivial and probably not very performant. 
First, you have to write your own Analyzer class in Perl. See the 
following threads for some guidance:

http://mail-archives.apache.org/mod_mbox/lucy-user/201111.mbox/%3C4EC161D0.1060103@aevum.de%3E
http://mail-archives.apache.org/mod_mbox/lucy-user/201207.mbox/%3C4FF1A2B7.7060207@easyconnect.no%3E

We really need a cookbook entry describing how to write custom 
analyzers. But to get started, here is some minimal skeleton code that I 
have used in the past:

     package My::Custom::Analyzer;
     use strict;

     use base qw(Lucy::Analysis::Analyzer);

     sub new {
         my ($class, %args) = @_;
         my $self = $class->SUPER::new(%args);

         # Setup your analyzer here

         return $self;
     }

     sub transform {
         my ($self, $inversion) = @_;

         while (my $token = $inversion->next) {
             my $text = $token->get_text;

             # Transform $text here

             $token->set_text($text);
         }

         $inversion->reset;
         return $inversion;
     }

     sub equals {
         return 1;
     }

     1;

For a proper implementation, you should also provide "dump" and "load" 
methods and a real "equals" method but they're not really necessary for 
a one-off job. Only remember to always reindex after changing the 
parameters of your custom analyzer. Without "dump", "load" and "equals" 
you won't get an error message in this case.

Your custom analyzer should then stem the words that are not in your 
stoplist one by one using the (undocumented) "split" method. So the 
"transform text" part of your analyzer would look like:

     if (!$stoplist->{$text}) {
         my $tokens = $stemmer->split($text);
         $text = $tokens->[0];
     }

Also note that you have to store member variables like "stoplist" and 
"stemmer" of your analyzer class using the "inside-out" approach (one 
global hash per variable). You'll find some example code showing how to 
do that in the threads I mentioned above.

Nick


Mime
View raw message