lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageAnalysis" by JanHoydahl
Date Sun, 25 Sep 2011 17:31:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=15&rev2=16

  
  There is no general rule for whether or not to stem: It depends not only on the language,
but also on the properties of your documents and queries.
  
+ Lucene/Solr provides different stemmers, and for some languages you may have multiple choices.
Some are algorithmic based, others are dictionary based. 
+ 
- The [[Hunspell]] stemmers are both dictionary and rule based and are thus fairly accurate
for most languages. The Snowball stemmers rely on algorithms and considered fairly aggressive,
but for many languages (see above) Solr provides alternatives that are less aggressive. In
many situations a lighter approach yields better relevance: often "less is more". The light
stemmers typically target the most common noun/adjective inflections, and perhaps a few derivational
suffixes. The minimal stemmers are even more conservative and may only remove plural endings.
+ The Snowball stemmers rely on algorithms and considered fairly aggressive, but for many
languages (see above) Solr provides alternatives that are less aggressive. In many situations
a lighter approach yields better relevance: often "less is more". The light stemmers typically
target the most common noun/adjective inflections, and perhaps a few derivational suffixes.
The minimal stemmers are even more conservative and may only remove plural endings. The new
Hunspell stemmers are both dictionary and rule based and may provide a tighter stemming than
Snowball for some languages.
  
  In general, if the language is highly inflectional, its worth evaluating at least a light
stemmer as it might bring a significant improvement. Or you may consider [[Hunspell]] which
have advanced rules combined with dictionaries of legal stems. Some annoyances caused by stemming
can then be handled with tuning: See {{{CustomizingStemming}}} below.
+ 
+ <!> NOTE: If stemming does not give enough precision for your requirements you may
consider [[http://en.wikipedia.org/wiki/Lemmatisation|lemmatization]]. No lemmatizers are
included with Solr, but there exist lemmatizers both commercial and open source.
  
  ==== Notes about solr.HunspellStemFilterFactory ====
  <!> [[Solr3.5]] The Hunspell stemmers are configured through the HunspellStemFilterFactory
combined with a dictionary and an affix file. Hunspell supports 99 languages.
- 
- If you are currently using Snowball stemmer, you should almost certainly switch to Hunspell
due to increased precission and less agressive stemming. See examples [[HunspellStemFilterFactory|here]]
  
  ==== Notes about solr.PorterStemFilterFactory ====
  

Mime
View raw message