lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageAnalysis" by RobertMuir
Date Wed, 14 Jul 2010 13:48:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by RobertMuir.
The comment on this change is: docs for new stem factories.
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=3&rev2=4

--------------------------------------------------

  <!> Note: See also {{{Decompounding}}} below.
  
  === English ===
- Solr includes two stemmers for English, the original Porter stemmer via {{{solr.PorterStemFilterFactory}}},
and the Porter2 stemmer via {{{solr.SnowballPorterFilterFactory}}}, as well as an example
stopword list.
+ Solr includes three stemmers for English: the original Porter stemmer via {{{solr.PorterStemFilterFactory}}},
the Porter2 stemmer via {{{solr.SnowballPorterFilterFactory}}}, and a plural-only stemmer
<!> [[Solr3.1]] via {{{solr.EnglishMinimalStemFilterFactory}}}. Lucene includes an example
stopword list from the snowball project.
  
  {{{
  ...
@@ -120, +120 @@

  [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
  
  === Finnish ===
- Solr includes support for stemming Finnish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
+ Solr includes two stemmers for Finnish: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.FinnishLightStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -130, +130 @@

  }}}
  
  Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
+ 
  <!> Note: See also {{{Decompounding}}} below.
+ 
+ <!> Note: The Snowball stemmer for Finnish has known bugs, due to a bug in [[http://article.gmane.org/gmane.comp.search.snowball/1139|snowball
itself]]
  
  === French ===
- Solr includes support for stemming French via {{{solr.SnowballPorterFilterFactory}}}, removing
elisions via ElisionFilterFactory, and Lucene includes an example stopword list.
+ Solr includes three stemmers for French: one via {{{solr.SnowballPorterFilterFactory}}},
an alternative stemmer <!> [[Solr3.1]] via {{{solr.FrenchLightStemFilterFactory}}},
and an even less aggressive approach <!> [[Solr3.1]] via {{{solr.FrenchMinimalStemFilterFactory}}}.
Solr can also removing elisions via {{{solr.ElisionFilterFactory}}}, and Lucene includes an
example stopword list.
  
  {{{
  ...
@@ -149, +152 @@

  <!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This
will prevent very slow phrase queries.
  
  === German ===
- Solr includes support for stemming German with three different algorithms: two via {{{solr.SnowballPorterFilterFactory}}},
and one via {{{solr.GermanStemFilterFactory}}}, and Lucene includes an example stopword list.
+ Solr includes support for stemming German with five different algorithms: two via {{{solr.SnowballPorterFilterFactory}}},
one via {{{solr.GermanStemFilterFactory}}}, a lightweight stemmer <!> [[Solr3.1]] via
{{{solr.GermanLightStemFilterFactory}}}, and an even less aggressive approach <!> [[Solr3.1]]
via {{{solr.GermanMinimalStemFilterFactory}}}. Lucene includes an example stopword list.
  
  With the {{{solr.SnowballPorterFilterFactory}}} you can supply two different language attributes:
"German" and "German2". German2 is just a modified version of German that handles the umlaut
characters differently: for example it treats "ΓΌ" as "ue" in most contexsts.
  
@@ -197, +200 @@

  
  === Hungarian ===
  
- Solr includes support for stemming Hungarian via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
+ Solr includes two stemmers for Hungarian: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.HungarianLightStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -227, +230 @@

  Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]
  
  === Italian ===
- Solr includes support for stemming Italian via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
+ Solr includes two stemmers for Italian: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.ItalianLightStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -267, +270 @@

  <!> Note: WordDelimiterFilter does not split on joiners by default. You can solve
this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory
which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider
using PositionFilter at query-time (only), as the QueryParser does not consider joiners and
could create unwanted phrase queries.
  
  === Portuguese ===
- Solr includes support for stemming Portuguese via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
+ Solr includes three stemmers for Portuguese: one via {{{solr.SnowballPorterFilterFactory}}},
an alternative stemmer <!> [[Solr3.1]] via {{{solr.PortugueseLightStemFilterFactory}}},
and an even less aggressive approach <!> [[Solr3.1]] via {{{solr.PortugueseMinimalStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -291, +294 @@

  Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Russian ===
- Solr includes support for stemming Russian via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
+ Solr includes two stemmers for Russian: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.RussianLightStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -303, +306 @@

  Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Spanish ===
- Solr includes support for stemming Spanish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
+ Solr includes two stemmers for Spanish: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.SpanishLightStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -315, +318 @@

  Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Swedish ===
- Solr includes support for stemming Swedish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
+ Solr includes two stemmers for Swedish: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.SwedishLightStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -428, +431 @@

  
  There is no general rule for whether or not to stem: It depends not only on the language,
but also on the properties of your documents and queries.
  
+ The snowball stemmers are considered fairly aggressive, but for many languages (see above)
Solr provides alternatives that are less aggressive. In many situations a lighter approach
yields better relevance: often "less is more". The light stemmers typically target the most
common noun/adjective inflections, and perhaps a few derivational suffixes. The minimal stemmers
are even more conservative and may only remove plural endings.
+ 
- In general, if the language is highly inflectional, its worth evaluating as it might bring
a significant improvement. Some annoyances caused by stemming can then be handled with tuning:
See {{{CustomizingStemming}}} below.
+ In general, if the language is highly inflectional, its worth evaluating at least a light
stemmer as it might bring a significant improvement. Some annoyances caused by stemming can
then be handled with tuning: See {{{CustomizingStemming}}} below.
  
  ==== Notes about solr.PorterStemFilterFactory ====
  

Mime
View raw message