incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] Other Analyzer renames
Date Wed, 09 Mar 2011 04:02:46 GMT
Greets,

In addition to the renaming of Tokenizer to RegexTokenizer, there are a couple
other Analyzer classes I think we should consider moving.

"Stemmer" should be changed for the same rationale as Tokenizer -- the generic
name should be reserved for the interface, as there are other stemmers out
there besides Snowball's.

    Lucy::Analysis::Stemmer => Lucy::Analysis::SnowballStemmer

Similarly, Lucy::Analysis::Stopalizer depends on materials that originate with
the Snowball project and should probably incorporate "Snowball" into its name.
However, unlike "tokenizer" and "stemmer", the word "stopalizer" isn't
standard terminology.  We don't have to keep it.

Lucene supplies "StopFilter" (which subclasses "TokenFilter") and
"StopAnalyzer" (which subclasses Analyzer).  Those suggest either
"SnowballStopFilter" or "SnowballStopAnalyzer", of which I think
"SnowballStopFilter" is better.

    Lucy::Analysis::Stopalizer => Lucy::Analysis::SnowballStopFilter

Lastly, I'm inclined towards breaking up PolyAnalyzer.  IMO, it should keep
its current behavior when you supply an array of analyzers...

    my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
        analyzers => \@analyzers,
    );

... but I think the PolyAnalyzer's language-specific pre-fab sets
incorporating a regex tokenizer, a Snowball stopalizer, and a Snowball stemmer
shouldn't be core Lucy.  In other words, it should be possible to compile Lucy
under the C API and use PolyAnalyzer's analyzers-in-series capabilities
without requiring linking in a regex engine and the Snowball libraries as
prerequisites.

My impulse is to factor an "EasyAnalyzer" class out of PolyAnalyzer.

    my $analyzer = Lucy::Analysis::EasyAnalyzer->new(
        language => 'en',
    );

However, I don't consider simplifying PolyAnalyzer as important as vacating
the namespaces for Tokenizer and Stemmer prior to release 0.1.0.  There's
likely to be an Analyzer overhaul focusing on speed, opening up the API, and
modularization sometime in the reasonably near future, and there's no
guarantee that "PolyAnalyzer" will survive that overhaul in a recognizable
form, if it survives at all.

Marvin Humphrey


Mime
View raw message