lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageAnalysis" by JanHoydahl
Date Sat, 24 Sep 2011 19:46:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=14&rev2=15

  
  There is no general rule for whether or not to stem: It depends not only on the language,
but also on the properties of your documents and queries.
  
- The snowball stemmers are considered fairly aggressive, but for many languages (see above)
Solr provides alternatives that are less aggressive. In many situations a lighter approach
yields better relevance: often "less is more". The light stemmers typically target the most
common noun/adjective inflections, and perhaps a few derivational suffixes. The minimal stemmers
are even more conservative and may only remove plural endings.
+ The [[Hunspell]] stemmers are both dictionary and rule based and are thus fairly accurate
for most languages. The Snowball stemmers rely on algorithms and considered fairly aggressive,
but for many languages (see above) Solr provides alternatives that are less aggressive. In
many situations a lighter approach yields better relevance: often "less is more". The light
stemmers typically target the most common noun/adjective inflections, and perhaps a few derivational
suffixes. The minimal stemmers are even more conservative and may only remove plural endings.
  
- In general, if the language is highly inflectional, its worth evaluating at least a light
stemmer as it might bring a significant improvement. Some annoyances caused by stemming can
then be handled with tuning: See {{{CustomizingStemming}}} below.
+ In general, if the language is highly inflectional, its worth evaluating at least a light
stemmer as it might bring a significant improvement. Or you may consider [[Hunspell]] which
have advanced rules combined with dictionaries of legal stems. Some annoyances caused by stemming
can then be handled with tuning: See {{{CustomizingStemming}}} below.
+ 
+ ==== Notes about solr.HunspellStemFilterFactory ====
+ <!> [[Solr3.5]] The Hunspell stemmers are configured through the HunspellStemFilterFactory
combined with a dictionary and an affix file. Hunspell supports 99 languages.
+ 
+ If you are currently using Snowball stemmer, you should almost certainly switch to Hunspell
due to increased precission and less agressive stemming. See examples [[HunspellStemFilterFactory|here]]
  
  ==== Notes about solr.PorterStemFilterFactory ====
  
@@ -500, +505 @@

  Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`,
in that it deviates slightly from the published algorithm.
  For more details, see the section "Points of difference from the published algorithm" described
[[http://tartarus.org/~martin/PorterStemmer/|here]].
  
- This is the fastest stemmer for English: approximately twice as fast as using SnowballPorterFilterFactory.
+ Porter is approximately twice as fast as using SnowballPorterFilterFactory.
+ 
+ ==== Notes about solr.KStemFilterFactory ====
+ <!> [[Solr3.3]] [[AnalyzersTokenizersTokenFilters/Kstem|KStem]] is an English language
stemmer which is similar to Porter but less agressive, and thus often preferred.
+ 
+ KStem is considerably faster than SnowballPorterFilterFactory.
  
  <<Anchor(SnowballPorterFilter)>>
  ==== Notes about solr.SnowballPorterFilterFactory ====

Mime
View raw message