lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir
Date Tue, 18 May 2010 16:34:51 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir.
The comment on this change is: move this stuff to language analysis.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=79&rev2=80

--------------------------------------------------

    </analyzer>
  </fieldtype>
  }}}
- 
- ==== solr.PorterStemFilterFactory ====
- 
- Creates `org.apache.lucene.analysis.PorterStemFilter`.
- 
- Standard Lucene implementation of the [[http://tartarus.org/~martin/PorterStemmer/|Porter
Stemming Algorithm]], a normalization process that removes common endings from words.
- 
-   Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
- 
- Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`,
in that it deviates slightly from the published algorithm.
- For more details, see the section "Points of difference from the published algorithm" described
[[http://tartarus.org/~martin/PorterStemmer/|here]].
- 
- <<Anchor(EnglishPorterFilter)>>
- ==== solr.EnglishPorterFilterFactory ====
- 
- Creates `solr.EnglishPorterFilter`.
- 
- Creates an [[http://snowball.tartarus.org/algorithms/english/stemmer.html|English Porter2
stemmer]] from the Java classes generated from a [[http://snowball.tartarus.org/|Snowball]]
specification.
- 
- A customized protected word list may be specified with the "protected" attribute in the
schema. Any words in the protected word list will not be modified by the stemmer.
- 
- A [[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/protwords.txt|sample
Solr protwords.txt with comments]] can be found in the Source Repository.
- 
- {{{
- <fieldtype name="myfieldtype" class="solr.TextField">
-   <analyzer>
-     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
-     <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
-   </analyzer>
- </fieldtype>
- }}}
- 
- 
- <<Anchor(SnowballPorterFilter)>>
- ==== solr.SnowballPorterFilterFactory ====
- 
- Creates `org.apache.lucene.analysis.SnowballPorterFilter`.
- 
- Creates an [[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball stemmer]]
from the Java classes generated from a [[http://snowball.tartarus.org/|Snowball]] specification.
 The language attribute is used to specify the language of the stemmer.
- {{{
- <fieldtype name="myfieldtype" class="solr.TextField">
-   <analyzer>
-     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
-     <filter class="solr.SnowballPorterFilterFactory" language="German" />
-   </analyzer>
- </fieldtype>
- }}}
- 
- Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"):
-  * [[http://snowball.tartarus.org/algorithms/danish/stemmer.html|Danish]]
-  * [[http://snowball.tartarus.org/algorithms/dutch/stemmer.html|Dutch]]
-  * [[http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html|Kp]]: The Kraaij-Pohlmann
stemming algorithm for Dutch.
-  * [[http://snowball.tartarus.org/algorithms/porter/stemmer.html|Porter]]: The original
Porter stemming algorithm for English.
-  * [[http://snowball.tartarus.org/algorithms/english/stemmer.html|English]]: The Porter2
stemming algorithm for English.
-  * [[http://snowball.tartarus.org/algorithms/lovins/stemmer.html|Lovins]]: The early Lovins
stemming algorithm for English.
-  * [[http://snowball.tartarus.org/algorithms/finnish/stemmer.html|Finnish]]
-  * [[http://snowball.tartarus.org/algorithms/french/stemmer.html|French]]
-  * [[http://snowball.tartarus.org/algorithms/german/stemmer.html|German]]
-  * [[http://snowball.tartarus.org/algorithms/german2/stemmer.html|German2]]: A variation
of the German algorithm with handling to allow ä, ö and ü to be represented by ae, oe and
ue
-  * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]]
-  * [[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]]
-  * [[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]]
-  * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]]
-  * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]]
-  * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]]
-  * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]]
-  * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]
- 
- <!> Gotchas:
-  * Although the Lovins stemmer is described as faster than Porter/Porter2, practically it
is much slower in Solr, as it is implemented using reflection.
-  * Neither the Lovins nor the Finnish stemmer produce correct output (as of Solr 1.4), due
to a [[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in Snowball]]
-  * The Turkish stemmer expects properly lowercased terms for correct output, but `LowerCaseFilterFactory`
does not lowercase turkish correctly. See [[https://issues.apache.org/jira/browse/LUCENE-2102|LUCENE-2102]]
and [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]].
-  * The stemmers are sensitive to diacritics. Think carefully before removing these with
something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results.
For example, `résumé` will not be stemmed by the Porter stemmer, but `resume` will be stemmed
to `resum`, causing it to match with `resumed`, `resuming`, etc. The differences can be more
profound for non-english stemmers.
- 
  
  <<Anchor(WordDelimiterFilter)>>
  ==== solr.WordDelimiterFilterFactory ====

Mime
View raw message