lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir
Date Fri, 05 Feb 2010 15:47:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir.
The comment on this change is: add romanian/turkish, with turkish gotcha, and provide an example
for diacritics.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=75&rev2=76

--------------------------------------------------

   * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]]
   * [[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]]
   * [[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]]
+  * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]]
   * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]]
   * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]]
   * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]]
+  * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]
  
  <!> Gotchas:
   * Although the Lovins stemmer is described as faster than Porter/Porter2, practically it
is much slower in Solr, as it is implemented using reflection.
   * Neither the Lovins nor the Finnish stemmer produce correct output (as of Solr 1.4), due
to a [[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in Snowball]]
-  * The Non-English stemmers are sensitive to diacritics. Think carefully before removing
these with something like `ASCIIFoldingFilterFactory` before stemming, as this could cause
unwanted results.
+  * The Turkish stemmer expects properly lowercased terms for correct output, but `LowerCaseFilterFactory`
does not lowercase turkish correctly. See [[https://issues.apache.org/jira/browse/LUCENE-2102|LUCENE-2102]]
and [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]].
+  * The stemmers are sensitive to diacritics. Think carefully before removing these with
something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results.
For example, `résumé` will not be stemmed by the Porter stemmer, but `resume` will be stemmed
to `resum`, causing it to match with `resumed`, `resuming`, etc. The differences can be more
profound for non-english stemmers.
+ 
  
  <<Anchor(WordDelimiterFilter)>>
  ==== solr.WordDelimiterFilterFactory ====

Mime
View raw message