lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "LanguageAnalysis" by iorixxx
Date Mon, 28 May 2012 11:39:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by iorixxx:
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=24&rev2=25

Comment:
Tukish stopwords URL was corrected

  = Language Analysis =
- 
  This page describes some of the language-specific analysis components available in Solr.
These components can be used to improve search results for specific languages.
  
- Please look at [[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersTokenFilters]] for other
analysis components you can use in combination with these components.
+ Please look at AnalyzersTokenizersTokenFilters for other analysis components you can use
in combination with these components.
  
- NOTE: This page is mostly '''obsolete'''. The [[http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml|Solr
Example]] now contains
+ NOTE: This page is mostly '''obsolete'''. The [[http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml|Solr
Example]] now contains configurations for various languages as fieldTypes (text_XX). This
is synchronized with the support from Lucene.
- configurations for various languages as fieldTypes (text_XX). This is synchronized with
the support from Lucene.
  
  <<TableOfContents>>
  
  == By language ==
- 
  === Arabic ===
  Solr provides support for the [[http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf|Light-10]]
stemming algorithm, and Lucene includes an example stopword list.
  
@@ -24, +21 @@

    <filter class="solr.ArabicStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Arabic [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Armenian ===
@@ -38, +34 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Armenian" />
  ...
  }}}
- 
  Example set of Armenian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Basque ===
@@ -52, +47 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Basque" />
  ...
  }}}
- 
  Example set of Basque [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Brazilian Portuguese ===
@@ -62, +56 @@

  ...
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.BrazilianStemFilterFactory"/>
- ... 
+ ...
  }}}
- 
  Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Bulgarian ===
@@ -78, +71 @@

    <filter class="solr.BulgarianStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Bulgarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Catalan ===
@@ -92, +84 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Catalan" />
  ...
  }}}
- 
  Example set of Catalan [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Chinese, Japanese, Korean ===
@@ -102, +93 @@

     <tokenizer class="solr.CJKTokenizerFactory"/>
  ...
  }}}
- 
  <!> [[Solr3.1]] Alternatively, for Simplified Chinese, Solr provides support for Chinese
word segmentation {{{solr.SmartChineseWordTokenFilterFactory}}} in the analysis-extras contrib
module. This component includes a large dictionary and segments Chinese text into words with
the Hidden Markov Model. To use this filter, see solr/contrib/analysis-extras/README.txt for
instructions on which jars you need to add to your SOLR_HOME/lib
  
  To use the default setup with fallback to English Porter stemmer for english words, use:
+ 
  {{{
     <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
  }}}
- 
  Or to configure your own analysis setup, use the SmartChineseSentenceTokenizerFactory along
with your custom filter setup. The sentence tokenizer tokenizes on sentence boundaries and
the SmartChineseWordTokenFilter breaks this further up into words.
+ 
  {{{
    <analyzer>
      <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
@@ -119, +110 @@

      <filter class="solr.PositionFilterFactory" />
    </analyzer>
  }}}
- 
- <!> Note: Be sure to use [[AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory|PositionFilter]]
at query-time (only) as these languages do not use spaces between words. 
+ <!> Note: Be sure to use [[AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory|PositionFilter]]
at query-time (only) as these languages do not use spaces between words.
  
  === Czech ===
  <!> [[Solr3.1]]
@@ -133, +123 @@

    <filter class="solr.CzechStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8))
  
  === Danish ===
@@ -145, +134 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Danish" />
  ...
  }}}
- 
  Example set of Danish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -159, +147 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
  ...
  }}}
- 
  An alternative stemmer (Kraaij-Pohlmann) can be used by specifying the language as "Kp".
  
  Example set of Dutch [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
@@ -175, +162 @@

    <filter class="solr.PorterStemFilterFactory"/>
  ...
  }}}
- 
  <!> Note: The standard {{{PorterStemFilterFactory}}} is significantly faster than
{{{solr.SnowballPorterFilterFactory}}}.
  
- Larger example set English 
- [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
+ Larger example set English  [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
  
  === Finnish ===
  Solr includes two stemmers for Finnish: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.FinnishLightStemFilterFactory}}}.
Lucene includes an example stopword list.
@@ -190, +175 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Finnish" />
  ...
  }}}
- 
  Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -208, +192 @@

    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  ...
  }}}
- 
  Example set of French [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This
will prevent very slow phrase queries.
@@ -224, +207 @@

    <filter class="solr.GalicianStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Galician [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === German ===
@@ -240, +222 @@

    <filter class="solr.SnowballPorterFilterFactory" language="German2" />
  ...
  }}}
- 
  Example set of German [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -254, +235 @@

    <filter class="solr.GreekStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Greek [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory
+ 
  === Hebrew ===
- 
  {{{
  ...
    <tokenizer class="solr.ICUTokenizerFactory"/>
  ...
  }}}
  Example set of Hebrew [[http://wiki.korotkin.co.il/Hebrew_stopwords|stopwords]] (Be sure
to switch your browser encoding to UTF-8)
+ 
  === Hindi ===
  <!> [[Solr3.1]]
  
@@ -278, +259 @@

    <filter class="solr.HindiStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Hindi [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Hungarian ===
- 
  Solr includes two stemmers for Hungarian: one via {{{solr.SnowballPorterFilterFactory}}},
and an alternative stemmer <!> [[Solr3.1]] via {{{solr.HungarianLightStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
@@ -291, +270 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Hungarian" />
  ...
  }}}
- 
  Example set of Hungarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -309, +287 @@

    <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true" />
  ...
  }}}
- 
  Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]
  
  === Italian ===
@@ -321, +298 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Italian" />
  ...
  }}}
- 
  Example set of Italian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Lao, Myanmar, Khmer ===
@@ -329, +305 @@

  
  Lucene provides support for segmenting these languages into syllables with {{{solr.ICUTokenizerFactory}}}
in the analysis-extras contrib module. To use this tokenizer, see solr/contrib/analysis-extras/README.txt
for instructions on which jars you need to add to your SOLR_HOME/lib
  
- <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do
not use spaces between words. 
+ <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do
not use spaces between words.
  
  === Norwegian ===
  Solr includes support for stemming Norwegian via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list. Since <!> [[Solr3.6]] you can also use
{{{solr.NorwegianLightStemFilterFactory}}} for a lighter variant or {{{solr.NorwegianMinimalStemFilterFactory}}}
attempting to normalize plural endings only. These two are simple rule based stemmers, not
handing exceptions or irregular forms.
@@ -340, +316 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Norwegian" />
  ...
  }}}
- 
  Example set of Norwegian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -354, +329 @@

    <filter class="solr.PersianNormalizationFilterFactory"/>
  ...
  }}}
- 
  Example set of Persian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: WordDelimiterFilter does not split on joiners by default. You can solve
this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory
which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider
using PositionFilter at query-time (only), as the QueryParser does not consider joiners and
could create unwanted phrase queries.
@@ -362, +336 @@

  === Polish ===
  <!> [[Solr3.1]]
  
+ Lucene provides support for Polish stemming {{{solr.StempelPolishStemFilterFactory}}} in
the analysis-extras contrib module. This component includes an algorithmic stemmer with tables
for Polish. To use this filter, see solr/contrib/analysis-extras/README.txt for instructions
on which jars you need to add to your SOLR_HOME/lib
- Lucene provides support for Polish stemming {{{solr.StempelPolishStemFilterFactory}}} in
the analysis-extras contrib module. This component includes an algorithmic stemmer with tables
for Polish.
- To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which
jars you need to add to your SOLR_HOME/lib
  
  {{{
  ...
@@ -371, +344 @@

    <filter class="solr.solr.StempelPolishStemFilterFactory"/>
  ...
  }}}
- 
  Example set of Polish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Portuguese ===
@@ -383, +355 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Portuguese" />
  ...
  }}}
- 
  Example set of Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Romanian ===
@@ -395, +366 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Romanian" />
  ...
  }}}
- 
  Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Russian ===
@@ -407, +377 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Russian" />
  ...
  }}}
- 
  Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Spanish ===
@@ -419, +388 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
  ...
  }}}
- 
  Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Swedish ===
@@ -431, +399 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Swedish" />
  ...
  }}}
- 
  Example set of Swedish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: See also {{{Decompounding}}} below.
@@ -444, +411 @@

    <filter class="solr.ThaiWordFilterFactory"/>
  ...
  }}}
- 
  <!> Note: Be sure to use PositionFilter at query-time (only) as this language does
not use spaces between words.
  
  === Turkish ===
@@ -456, +422 @@

    <filter class="solr.SnowballPorterFilterFactory" language="Turkish" />
  ...
  }}}
- 
- Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
+ Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/lang/stopwords_tr.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory <!>
[[Solr3.1]]
  
  == My language is not listed!!! ==
- 
  Your language might work anyway. A first step is to start with the "textgen" type in the
example schema. Remember, things like stemming and stopwords aren't mandatory for the search
to work, only optional language-specific improvements.
  
  If you have problems (your language is highly-inflectional, etc), you might want to try
using an n-gram approach as an alternative.
  
  == Other Tips ==
  === Tokenization ===
- 
  In general most languages don't require special tokenization (and will work just fine with
Whitespace + WordDelimiterFilter), so you can safely tailor the English "text" example schema
definition to fit.
  
  === Ignoring Case ===
- 
- In most cases LowerCaseFilterFactory is sufficient. 
- However, some languages have special casing properties, and these have their own filters:
+ In most cases LowerCaseFilterFactory is sufficient.  However, some languages have special
casing properties, and these have their own filters:
  
   * TurkishLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory for the Turkish
language. It includes special handling for [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|dotted
and dotless I]].
   * GreekLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory for the Greek
language. It removes Greek diacritics and has special handling for the Greek final sigma.
  
  === Ignoring Diacritics ===
- 
  Some languages use diacritics, but people are not always consistent about typing them in
queries or documents.
  
  If you are using a stemmer, most stemmers (especially Snowball stemmers) are a bit forgiving
about diacritics, and these are handled on a language-specific basis.
@@ -493, +453 @@

  For other languages, the ASCIIFoldingFilterFactory won't do the foldings that you need.
One solution is to use {{{solr.analysis.ICUFoldingFilterFactory}}} <!> [[Solr3.1]],
which implements a [[http://unicode.org/reports/tr30/tr30-4.html|similar idea]] across all
of Unicode
  
  === Stopwords ===
- 
  Stopwords affect Solr in three ways: relevance, performance, and resource utilization.
  
  From a relevance perspective, these extremely high-frequency terms tend to throw off the
scoring algorithm, and you won't get very good results if you leave them. At the same time,
if you remove them, you can return bad results when the stopword is actually important.
@@ -505, +464 @@

  One tradeoff you can make if you have the disk space: You can use CommonGramsFilter/CommonGramsQueryFilter
instead of StopFilter. This solves the relevance and performance problems, at the expense
of even more resource utilization, because it will form bigrams of stopwords to their adjacent
words.
  
  === Stemming ===
- 
  Stemming can help improve relevance, but it can also hurt.
  
  There is no general rule for whether or not to stem: It depends not only on the language,
but also on the properties of your documents and queries.
  
- Lucene/Solr provides different stemmers, and for some languages you may have multiple choices.
Some are algorithmic based, others are dictionary based. 
+ Lucene/Solr provides different stemmers, and for some languages you may have multiple choices.
Some are algorithmic based, others are dictionary based.
  
  The Snowball stemmers rely on algorithms and considered fairly aggressive, but for many
languages (see above) Solr provides alternatives that are less aggressive. In many situations
a lighter approach yields better relevance: often "less is more". The light stemmers typically
target the most common noun/adjective inflections, and perhaps a few derivational suffixes.
The minimal stemmers are even more conservative and may only remove plural endings. The new
Hunspell stemmers are both dictionary and rule based and may provide a tighter stemming than
Snowball for some languages.
  
@@ -522, +480 @@

  <!> [[Solr3.5]] The Hunspell stemmers are configured through the HunspellStemFilterFactory
combined with a dictionary and an affix file. Hunspell supports 99 languages.
  
  ==== Notes about solr.PorterStemFilterFactory ====
- 
  Porter stemmer for the English language.
  
  Standard Lucene implementation of the [[http://tartarus.org/~martin/PorterStemmer/|Porter
Stemming Algorithm]], a normalization process that removes common endings from words.
  
-   Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
+  . Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
  
+ Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`,
in that it deviates slightly from the published algorithm. For more details, see the section
"Points of difference from the published algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]].
- Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`,
in that it deviates slightly from the published algorithm.
- For more details, see the section "Points of difference from the published algorithm" described
[[http://tartarus.org/~martin/PorterStemmer/|here]].
  
  Porter is approximately twice as fast as using SnowballPorterFilterFactory.
  
@@ -540, +496 @@

  KStem is considerably faster than SnowballPorterFilterFactory.
  
  <<Anchor(SnowballPorterFilter)>>
+ 
  ==== Notes about solr.SnowballPorterFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.SnowballPorterFilter`.
  
  Creates an [[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball stemmer]]
from the Java classes generated from a [[http://snowball.tartarus.org/|Snowball]] specification.
 The language attribute is used to specify the language of the stemmer.
+ 
  {{{
  <fieldtype name="myfieldtype" class="solr.TextField">
    <analyzer>
@@ -553, +510 @@

    </analyzer>
  </fieldtype>
  }}}
- 
  Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"):
+ 
   * [[http://snowball.tartarus.org/algorithms/armenian/stemmer.html|Armenian]] <!>
[[Lucene3.1]]
   * [[http://snowball.tartarus.org/algorithms/basque/stemmer.html|Basque]] <!> [[Lucene3.1]]
   * [[http://snowball.tartarus.org/algorithms/catalan/stemmer.html|Catalan]] <!> [[Lucene3.1]]
@@ -579, +536 @@

   * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]
  
  <!> Gotchas:
+ 
   * Although the Lovins stemmer is described as faster than Porter/Porter2, practically it
is much slower in Solr, as it is implemented using reflection.
   * Neither the Lovins nor the Finnish stemmer produce correct output (as of Solr 1.4), due
to a [[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in Snowball]]
   * The Turkish stemmer requires special lowercasing. You should use TurkishLowerCaseFilter
instead of LowerCaseFilter with this language. See [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background
information]].
   * The stemmers are sensitive to diacritics. Think carefully before removing these with
something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results.
For example, `résumé` will not be stemmed by the Porter stemmer, but `resume` will be stemmed
to `resum`, causing it to match with `resumed`, `resuming`, etc. The differences can be more
profound for non-english stemmers.
  
  <<Anchor(CustomizingStemming)>>
+ 
  === Customizing Stemming ===
- 
  Sometimes a stemmer might not do what you want out-of-box. For example, you might be happy
with the results on average, but have a few particular cases (such as Product Names or similar)
where it annoys you or actually hurts your search results.
  
  The components below allow you to fine-tune the stemming process by preventing words from
being stemmed at all, or by overriding the stemming algorithm with custom mappings.
@@ -609, +567 @@

    </analyzer>
  </fieldtype>
  }}}
- 
  ==== solr.StemmerOverrideFilterFactory ====
  <!> [[Solr3.1]]
  
@@ -628, +585 @@

    </analyzer>
  </fieldtype>
  }}}
- 
  <<Anchor(Decompounding)>>
+ 
  === Decompounding ===
- 
  Decompounding can improve search results for some languages. At the same time, it can increase
the time it takes to index and search, as well as increase the index size itself.
  
  Solr provides dictionary-based decompounding support via solr.DictionaryCompoundWordTokenFilterFactory.
This factory allows you to provide a dictionary, along with some settings (min/max subword
size, etc), to break compound words into pieces.

Mime
View raw message