incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] stemming, Lucy and Stem::Lingua::Snowball
Date Sun, 10 Jul 2011 03:46:22 GMT
Hello, Arjan,

Thanks for the thorough example and explanation.  Good test cases are
very helpful!

On Sat, Jul 09, 2011 at 06:07:18PM +0200, arjan wrote:
> In English possession can be indicated by apostrophe s. Like: "this  
> man's computer". In Dutch this is almost the same, only in most cases  
> without the apostrophe. We only use an apostrophe when the word ends on  
> an s or on a/o/e/i/u. So for example:
>
> Jans hoed (hat)
> Jos' tas (bag)
> Monica's jas (coat)
>
> The Stem::Lingua::Snowball module does not know this. The small script  
> below this email demonstrates that.
>
> The default is stemmed correctly Jans -> Jan. On the exceptions - Jos'  
> and Minonica's - the stemmer leaves the apostrophe at the end. And the -  
> in Dutch erroneous - spelling of Jans as Jan's is also stemmed wrongly.
>
> In Lucy this leads to having Jos' and Monica' as words in the lexicon.  
> Messages with "Monica's" will not be found when searching on "Monica".  
> This is demonstrated with the word Halsema's in the second copy-paste  
> script below.
>
> Is this indeed a bug?

If I understand your explanation well enough, then I think we may want to
treat it as a Lucy bug.

It seems that the behavior of the Dutch Snowball stemmer is known and
intentional.  From the Snowball website:

    http://snowball.tartarus.org/texts/introduction.html

    The Dutch stemmer presented here assumes hyphen and apostrophe have
    already been removed from the word to be stemmed. 

That means we either have a bug in Lucy::Analysis::SnowballStemmer or
Lucy::Analysis::PolyAnalyzer, depending on how independent we consider
SnowballStemmer to be.

If we believe that SnowballStemmer should compensate for the idiosyncrasies of
the Snowball library and assume responsibility for stripping apostrophes, then
SnowballStemmer has a bug.

If we believe that SnowballStemmer should be the thinnest possible wrapper
around the Snowball libraries and that it should be PolyAnalyzer's
responsibility to feed it materials with apostrophes already stripped, then
PolyAnalyzer has a bug.

I suspect that we want SnowballStemmer to assume responsibility, since that
will make it easier to use SnowballStemmer as a component.  I don't think it
would be wise for us to require that the user know about this quirk and
manually intervene to compensate for it when assembling a custom PolyAnalyzer.

Still, there's also the possibility of using a different default Tokenizer
pattern within the Dutch PolyAnalyzer.  This is the existing pattern, which is
optimized for English:

    # Matches "it's", "O'Henry's", etc...
    "\\w+(?:[\\x{2019}']\\w+)*"

Is that also well-optimized for Dutch?

> Is there a way to work around this?

I believe the following PolyAnalyzer will get the job done for you:

  my $case_folder = Lucy::Analysis::CaseFolder->new;
  my $tokenizer   = Lucy::Analysis::RegexTokenizer->new;
  my $stemmer     = Lucy::Analysis::SnowballStemmer->new( language => 'nl' );
  my $apostrophe_stripper
    = Lucy::Analysis::RegexTokenizer->new( pattern => ".*[^'\\x{2019}]" );
  my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
    analyzers => [ $case_folder, $tokenizer, $stemmer, $apostrophe_stripper ],
  );

Best,

Marvin Humphrey


Mime
View raw message