lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageAnalysis" by RobertMuir
Date Thu, 03 Mar 2011 03:09:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by RobertMuir.
The comment on this change is: add hy, ca, eu, gl and updates for analysis-extras contrib.
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=10&rev2=11

--------------------------------------------------

  
  Example set of Arabic [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
+ === Armenian ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes support for stemming Armenian via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.SnowballPorterFilterFactory" language="Armenian" />
+ ...
+ }}}
+ 
+ Example set of Armenian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
+ 
+ === Basque ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes support for stemming Basque via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.SnowballPorterFilterFactory" language="Basque" />
+ ...
+ }}}
+ 
+ Example set of Basque [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
+ 
  === Brazilian Portuguese ===
  Solr includes a modified version of the Snowball Portuguese algorithm for Brazilian Portuguese,
and Lucene includes an example stopword list. This stemmer handles diacritical marks differently
than the European Portuguese stemmer.
  
@@ -34, +62 @@

  ... 
  }}}
  
- Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java|stopwords]]
(Look for BRAZILIAN_STOP_WORDS)
+ Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Bulgarian ===
  <!> [[Solr3.1]]
@@ -49, +77 @@

  }}}
  
  Example set of Bulgarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
+ 
+ === Catalan ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes support for stemming Catalan via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.SnowballPorterFilterFactory" language="Catalan" />
+ ...
+ }}}
+ 
+ Example set of Catalan [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  === Chinese, Japanese, Korean ===
  Lucene provides support for these languages with CJKTokenizer, which indexes bigrams and
does some character folding of full-width forms.
  
+ <!> [[Solr3.1]] Alternatively, for Simplified Chinese, Solr provides support for Chinese
word segmentation {{{solr.SmartChineseWordTokenFilterFactory}}} in the analysis-extras contrib
module. This component includes a large dictionary and segments Chinese text into words with
the Hidden Markov Model. To use this filter, see solr/contrib/analysis-extras/README.txt for
instructions on which jars you need to add to your SOLR_HOME/lib
+ 
  {{{
     <tokenizer class="solr.CJKTokenizerFactory"/>
  ...
@@ -72, +116 @@

  ...
  }}}
  
- Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java|stopwords]]
(Look for CZECH_STOP_WORDS)
+ Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8))
  
  === Danish ===
  Solr includes support for stemming Danish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
@@ -151, +195 @@

  
  <!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This
will prevent very slow phrase queries.
  
+ === Galician ===
+ <!> [[Solr3.1]]
+ 
+ Solr includes a stemmer for Galician following this [[http://bvg.udc.es/recursos_lingua/stemming.jsp|algorithm]],
and Lucene includes an example stopword list.
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.GalicianStemFilterFactory"/>
+ ...
+ }}}
+ 
+ Example set of Galician [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
+ 
  === German ===
  Solr includes support for stemming German with five different algorithms: two via {{{solr.SnowballPorterFilterFactory}}},
one via {{{solr.GermanStemFilterFactory}}}, a lightweight stemmer <!> [[Solr3.1]] via
{{{solr.GermanLightStemFilterFactory}}}, and an even less aggressive approach <!> [[Solr3.1]]
via {{{solr.GermanMinimalStemFilterFactory}}}. Lucene includes an example stopword list.
  
@@ -241, +299 @@

  
  Example set of Italian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
+ === Lao, Myanmar, Khmer ===
+ <!> [[Solr3.1]]
+ 
+ Lucene provides support for segmenting these languages into syllables with {{{solr.ICUTokenizerFactory}}}
in the analysis-extras contrib module. To use this tokenizer, see solr/contrib/analysis-extras/README.txt
for instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do
not use spaces between words. 
+ 
  === Norwegian ===
  Solr includes support for stemming Norwegian via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
  
@@ -265, +330 @@

  ...
  }}}
  
- Example set of Persian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]
+ Example set of Persian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: WordDelimiterFilter does not split on joiners by default. You can solve
this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory
which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider
using PositionFilter at query-time (only), as the QueryParser does not consider joiners and
could create unwanted phrase queries.
  
+ === Polish ===
+ <!> [[Solr3.1]]
+ 
+ Lucene provides support for Polish stemming {{{solr.StempelPolishStemFilterFactory}}} in
the analysis-extras contrib module. This component includes an algorithmic stemmer with tables
for Polish.
+ To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which
jars you need to add to your SOLR_HOME/lib
+ 
+ {{{
+ ...
+   <filter class="solr.LowerCaseFilterFactory"/>
+   <filter class="solr.solr.StempelPolishStemFilterFactory"/>
+ ...
+ }}}
+ 
+ Example set of Polish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
+ 
  === Portuguese ===
- Solr includes three stemmers for Portuguese: one via {{{solr.SnowballPorterFilterFactory}}},
an alternative stemmer <!> [[Solr3.1]] via {{{solr.PortugueseLightStemFilterFactory}}},
and an even less aggressive approach <!> [[Solr3.1]] via {{{solr.PortugueseMinimalStemFilterFactory}}}.
Lucene includes an example stopword list.
+ Solr includes four stemmers for Portuguese: one via {{{solr.SnowballPorterFilterFactory}}},
an alternative stemmer <!> [[Solr3.1]] via {{{solr.PortugueseStemFilterFactory}}}, a
lighter stemmer <!> [[Solr3.1]] via {{{solr.PortugueseLightStemFilterFactory}}}, and
an even less aggressive approach <!> [[Solr3.1]] via {{{solr.PortugueseMinimalStemFilterFactory}}}.
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -355, +435 @@

  Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
  <!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory <!>
[[Solr3.1]]
- 
- == Not yet Integrated ==
- 
- The following languages have explicit support in Lucene, but it is not yet integrated into
Solr. If you need to support these languages you might find this information useful in the
meantime.
- 
- === Chinese, Japanese, Korean ===
- 
- Lucene provides support for Chinese word segmentation (SentenceTokenizer, WordTokenFilter)
in a separate jar file (lucene-analyzers-smartcn.jar). This component includes a large dictionary
and segments Chinese text into words with the Hidden Markov Model.
- 
- <!> [[Lucene3.1]]
- 
- Additionally, Lucene provides support for matching between Traditional and Simplified Chinese
and for matching between Hiragana and Katakana (ICUTransformFilter) in a separate jar file
(lucene-icu.jar).
- 
- <!> Note: Be sure to use PositionFilter at query-time (only) as this language does
not use spaces between words.
- 
- === Polish ===
- <!> [[Lucene3.1]]
- 
- Lucene provides support for Polish stemming (StempelFilter) in a separate jar file (lucene-analyzers-stempel.jar).
This component includes an algorithmic stemmer with tables for Polish.
- 
- === Lao, Myanmar, Khmer ===
- <!> [[Lucene3.1]]
- 
- Lucene provides support for segmenting these languages into syllables (ICUTokenizer) in
a separate jar file (lucene-icu.jar).
- 
- <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do
not use spaces between words. 
  
  == My language is not listed!!! ==
  
@@ -464, +518 @@

  }}}
  
  Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"):
+  * [[http://snowball.tartarus.org/algorithms/armenian/stemmer.html|Armenian]] <!>
[[Lucene3.1]]
+  * [[http://snowball.tartarus.org/algorithms/basque/stemmer.html|Basque]] <!> [[Lucene3.1]]
+  * [[http://snowball.tartarus.org/algorithms/catalan/stemmer.html|Catalan]] <!> [[Lucene3.1]]
   * [[http://snowball.tartarus.org/algorithms/danish/stemmer.html|Danish]]
   * [[http://snowball.tartarus.org/algorithms/dutch/stemmer.html|Dutch]]
   * [[http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html|Kp]]: The Kraaij-Pohlmann
stemming algorithm for Dutch.

Mime
View raw message