lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Migrating SnowballAnalyzer to 4.1
Date Thu, 28 Feb 2013 21:15:02 GMT
Hi Peng,

The short answer: EnglishAnalyzer will behave differently in terms of stemming than SnowballAnalyzer("English",
StopAnalyzer.ENGLISH_STOP_WORDS).

PorterStemmer, which is used by the EnglishAnalyzer in analyzers-common, is an older version
of the English Snowball stemmer (now called EnglishStemmer in Lucene).  See the Status section
in Martin Porter's page[1] for a description of the differences - PorterStemmer implements
his original algorithm, and EnglishStemmer implements the Porter2 algorithm.

EnglishAnalyzer has used PorterStemmer instead of the English Snowball stemmer since it was
created in 2010 as part of LUCENE-2055[2].  I think this is an oversight: EnglishAnalyzer
should incorporate the best English stemmer we've got, and Martin Porter says the Porter2
stemmer is better[1].  Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do
you think?  

If you want a drop in replacement, for now it looks like you'll have to put it together yourself.
 It's not hard - you can copy the source for EnglishAnalyzer[1] and substitute 'SnowballFilter(result,
"English")' where you see 'PorterStemFilter(result)', and import org.apache.lucene.analysis.snowball.SnowballFilter.

Steve

[1] Martin Porter's Porter Stemming Algorithm page: <http://tartarus.org/martin/PorterStemmer/index.html>
[2] LUCENE-2055: <https://issues.apache.org/jira/browse/LUCENE-2055>
[3] EnglishAnalyzer source code: <http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_1_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java>

As far as I can tell,  
On Feb 28, 2013, at 2:52 PM, Peng Gao <pgao@esri.com> wrote:

> Hi Steve,
> Thanks for the help. One more question:
> Is EnglishAnalyzer a drop-in replacement for SnowballAnalyzer("English", ...), in terms
> of stemming?
> 
> 
> Thanks again
> Peng
> 
> PS
> Sorry for the Thread Hijacking. Will behave the next time.
> 
>> -----Original Message-----
>> From: Steve Rowe [mailto:sarowe@gmail.com]
>> Sent: Thursday, February 28, 2013 10:47 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Migrating SnowballAnalyzer to 4.1
>> 
>> Hi Peng,
>> 
>> Take a look at the release docs:
>> <http://lucene.apache.org/core/4_1_0/index.html>
>> 
>> In particular, in the API Javadocs section, the analyzers-common
>> documentation has a large list of per-language analyzers.  EnglishAnalyzer
>> is under the org.apache.lucene.analysis.en package:
>> <http://lucene.apache.org/core/4_1_0/analyzers-
>> common/org/apache/lucene/analysis/en/package-summary.html>
>> 
>> Steve
>> 
>> On Feb 28, 2013, at 1:28 PM, Peng Gao <pgao@esri.com> wrote:
>> 
>>> Hi,
>>> 
>>> I have a Lucene 2.9.x app that uses
>>> org.apache.lucene.analysis.snowball.SnowballAnalyzer for index
>>> generation,
>>> 
>>> analyzer = new SnowballAnalyzer("English",
>>> StopAnalyzer.ENGLISH_STOP_WORDS);
>>> 
>>> and I want to upgrade it to 4.1.
>>> 
>>> SnowballAnalyzer is deprecated in 4.1. The doc simply states
>>> 
>>> "Deprecated. (3.1) Use the language-specific analyzer in
>>> modules/analysis instead. This analyzer will be removed in Lucene 5.0."
>>> 
>>> I can't figure out how to rewrite it using 4.1 API. Could you help?
>>> 
>>> Thanks,
>>> Peng
>>> 
>>> Т
>>> ХF
>>> V 7V'67& &R R   â f W6W" V 7V'67& &T V6V R 6 R  &pФf
"FF F
>>>   6    G2 R   â f W6W"ֆV  V6V R 6 R  &pР
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message