lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arjan <ar...@unitedknowledge.nl>
Subject [lucy-user] stemming, Lucy and Stem::Lingua::Snowball
Date Sat, 09 Jul 2011 16:07:18 GMT
Dear all,

In English possession can be indicated by apostrophe s. Like: "this 
man's computer". In Dutch this is almost the same, only in most cases 
without the apostrophe. We only use an apostrophe when the word ends on 
an s or on a/o/e/i/u. So for example:

Jans hoed (hat)
Jos' tas (bag)
Monica's jas (coat)

The Stem::Lingua::Snowball module does not know this. The small script 
below this email demonstrates that.

The default is stemmed correctly Jans -> Jan. On the exceptions - Jos' 
and Minonica's - the stemmer leaves the apostrophe at the end. And the - 
in Dutch erroneous - spelling of Jans as Jan's is also stemmed wrongly.

In Lucy this leads to having Jos' and Monica' as words in the lexicon. 
Messages with "Monica's" will not be found when searching on "Monica". 
This is demonstrated with the word Halsema's in the second copy-paste 
script below.

Is this indeed a bug? Is there a way to work around this?

Kind regards,
Arjan Widlak

United Knowledge
http://www.unitedknowledge.nl

---Lingua::Stem::Snowball--------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;

use Encode;
use Lingua::Stem::Snowball;

my @words = qw( Jans Jos' Monica's Jan's );

my $stemmer = Lingua::Stem::Snowball->new( lang => 'nl' );
$stemmer->stem_in_place( \@words );

foreach my $word ( @words ) {
     say encode( 'utf8', $word );
}
---Lingua::Stem::Snowball--------------------------------------------------------------------------------------

---Lucy---------------------------------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;
use Encode;

use Lucy::Plan::Schema;
use Lucy::Index::Indexer;
use Lucy::Search::IndexSearcher;
use Lucy::Analysis::RegexTokenizer;
use Lucy::Analysis::PolyAnalyzer;
use Lucy::Analysis::CaseFolder;
use Lucy::Analysis::SnowballStemmer;
use Lucy::Index::IndexReader;
use Lucy::Index::LexiconReader;
use utf8; #data in script itself

# create an index
my $document1 = {
     searchstring    => qq|In een column schrijft hij een reactie op 
Femke Halsema's voorstel om te komen tot meer samenwerking op links.|,
};

my $message_storage = "/tmp";
my $schema          = Lucy::Plan::Schema->new;
my $case_folder     = Lucy::Analysis::CaseFolder->new;
my $tokenizer       = Lucy::Analysis::RegexTokenizer->new;

my $stemmer = Lucy::Analysis::SnowballStemmer->new(
     language    => 'nl',
);
my $polyanalyzer    = Lucy::Analysis::PolyAnalyzer->new(
     language    => 'nl',
     analyzers   => [ $case_folder, $tokenizer, $stemmer ],
);

# Field Types
my $type_text    = Lucy::Plan::FullTextType->new(
     analyzer        => $polyanalyzer,
     indexed         => 1,
     stored          => 1,
     sortable        => 0
);

$schema->spec_field( name => "searchstring", type => $type_text );
my $indexer = Lucy::Index::Indexer->new(
     schema      => $schema,
     index       => $message_storage,
     create      => 1,
     truncate    => 1,
);

$indexer->add_doc( $document1 );
$indexer->commit;

# See what we find
my $query_parser = Lucy::Search::QueryParser->new(
     schema  => $schema,
     fields  => [ 'searchstring' ],
);

my $query = $query_parser->parse( qw( Halsema ) );

my $searcher = Lucy::Search::IndexSearcher->new(
     index => $message_storage,
);

my $hits = $searcher->hits(
     query       => $query,
     offset      => 0,
     num_wanted  => 10000,
);

say encode( 'utf8', "\n\tHits from the index:");
while ( my $hit = $hits->next ) {
     say encode( 'utf8', "found hit on: " . $hit->{ searchstring } );
}

# See what's in the lexicon
my $polyreader = Lucy::Index::IndexReader->open(
         index => $message_storage,
     );
my $seg_readers = $polyreader->seg_readers;

say encode('utf8', "\n\tIndividual words in the lexicon:");
foreach my $seg_reader ( @$seg_readers ) {
     my $lex_reader = $seg_reader->obtain( "Lucy::Index::LexiconReader" );
     my $lexicon    = $lex_reader->lexicon( field => 'searchstring' );

     while ( $lexicon->next ) {
         say encode( 'utf8', $lexicon->get_term );
     }
}
---Lucy---------------------------------------------------------------------------------------------------------------

-- 
Recent: http://www.lomcongres.nl/
Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk Overleg Milieuhandhaving

Setting Standards, a a Delft University of Technology and United Knowledge simulation exercise
on strategy and cooperation in standardization, http://www.setting-standards.com

United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301
bureau@unitedknowledge.nl
http://www.unitedknowledge.nl

M +31 (0)6 2427 1444
E arjan@unitedknowledge.nl

Bezoek onze site op:
http://www.unitedknowledge.nl

Of bekijk een van onze projecten:
http://www.handhavingsportaal.nl/
http://www.setting-standards.com/
http://www.lomcongres.nl/
http://www.clubvanmaarssen.org/




Mime
View raw message