lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageAnalysis" by HossMan
Date Wed, 19 May 2010 21:08:01 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by HossMan.
The comment on this change is: normalize header nesting level.
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=1&rev2=2

--------------------------------------------------

  = Language Analysis =
  
- == Overview ==
- 
  This page describes some of the language-specific analysis components available in Solr.
These components can be used to improve search results for specific languages.
  
  Please look at [[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersTokenFilters]] for other
analysis components you can use in combination with these components.
  
  <<TableOfContents>>
  
- === By language ===
+ == By language ==
+ 
- ==== Arabic ====
+ === Arabic ===
  Solr provides support for the [[http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf|Light-10]]
stemming algorithm, and Lucene includes an example stopword list.
  
  This algorithm defines both character normalization and stemming, so these are split into
two filters to provide more flexibility.
@@ -25, +24 @@

  
  Example set of Arabic [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Brazilian Portuguese ====
+ === Brazilian Portuguese ===
  Solr includes a modified version of the Snowball Portuguese algorithm for Brazilian Portuguese,
and Lucene includes an example stopword list. This stemmer handles diacritical marks differently
than the European Portuguese stemmer.
  
  {{{
@@ -37, +36 @@

  
  Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java|stopwords]]
(Look for BRAZILIAN_STOP_WORDS)
  
- ==== Bulgarian ====
+ === Bulgarian ===
  <!> [[Solr3.1]]
  
  Solr includes a light stemmer for Bulgarian, following this [[http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf|algorithm]],
and Lucene includes an example stopword list.
@@ -51, +50 @@

  
  Example set of Bulgarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Chinese, Japanese, Korean ====
+ === Chinese, Japanese, Korean ===
  Lucene provides support for these languages with CJKTokenizer, which indexes bigrams and
does some character folding of full-width forms.
  
  {{{
@@ -61, +60 @@

  
  <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do
not use spaces between words. 
  
- ==== Czech ====
+ === Czech ===
  <!> [[Solr3.1]]
  
  Solr includes a light stemmer for Czech, following this [[http://portal.acm.org/citation.cfm?id=1598600|algorithm]],
and Lucene includes an example stopword list.
@@ -75, +74 @@

  
  Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java|stopwords]]
(Look for CZECH_STOP_WORDS)
  
- ==== Danish ====
+ === Danish ===
  Solr includes support for stemming Danish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
  
  {{{
@@ -89, +88 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Dutch ====
+ === Dutch ===
  Solr includes two stemmers for Dutch via {{{solr.SnowballPorterFilterFactory}}}, and Lucene
includes an example stopword list.
  
  {{{
@@ -105, +104 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== English ====
+ === English ===
  Solr includes two stemmers for English, the original Porter stemmer via {{{solr.PorterStemFilterFactory}}},
and the Porter2 stemmer via {{{solr.SnowballPorterFilterFactory}}}, as well as an example
stopword list.
  
  {{{
@@ -120, +119 @@

  Larger example set English 
  [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
  
- ==== Finnish ====
+ === Finnish ===
  Solr includes support for stemming Finnish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
  
  {{{
@@ -133, +132 @@

  Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== French ====
+ === French ===
  Solr includes support for stemming French via {{{solr.SnowballPorterFilterFactory}}}, removing
elisions via ElisionFilterFactory, and Lucene includes an example stopword list.
  
  {{{
@@ -149, +148 @@

  
  <!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This
will prevent very slow phrase queries.
  
- ==== German ====
+ === German ===
  Solr includes support for stemming German with three different algorithms: two via {{{solr.SnowballPorterFilterFactory}}},
and one via {{{solr.GermanStemFilterFactory}}}, and Lucene includes an example stopword list.
  
  With the {{{solr.SnowballPorterFilterFactory}}} you can supply two different language attributes:
"German" and "German2". German2 is just a modified version of German that handles the umlaut
characters differently: for example it treats "ΓΌ" as "ue" in most contexsts.
@@ -167, +166 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Greek ====
+ === Greek ===
  Solr includes support for stemming Greek following this [[http://people.dsv.su.se/~hercules/papers/Ntais_greek_stemmer_thesis_final.pdf|algorithm]]
<!> [[Solr3.1]], as well as support for case/diacritics-insensitive search via {{{solr.GreekLowerCaseFilterFactory}}},
and Lucene includes an example stopword list.
  
  {{{
@@ -181, +180 @@

  
  <!> Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory
  
- ==== Hindi ====
+ === Hindi ===
  <!> [[Solr3.1]]
  
  Solr includes support for stemming Hindi following this [[http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf|algorithm]],
support for common spelling differences via {{{solr.HindiNormalizationFilterFactory}}} following
this [[http://web2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5b38c-02ee-41ce-9a8f-3e745670be32.pdf|algorithm]],
support for encoding differences via {{{solr.IndicNormalizationFilterFactory}}} following
this [[http://ldc.upenn.edu/myl/IndianScriptsUnicode.html|algorithm]], and Lucene includes
an example stopword list.
@@ -196, +195 @@

  
  Example set of Hindi [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Hungarian ====
+ === Hungarian ===
  
  Solr includes support for stemming Hungarian via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
  
@@ -211, +210 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Indonesian ====
+ === Indonesian ===
  <!> [[Solr3.1]]
  
  Solr includes support for stemming Indonesian (Bahasa Indonesia) following this [[http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf|algorithm]],
and Lucene includes an example stopword list.
@@ -227, +226 @@

  
  Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]
  
- ==== Italian ====
+ === Italian ===
  Solr includes support for stemming Italian via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
  
  {{{
@@ -239, +238 @@

  
  Example set of Italian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Norwegian ====
+ === Norwegian ===
  Solr includes support for stemming Norwegian via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
  
  {{{
@@ -253, +252 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Persian / Farsi ====
+ === Persian / Farsi ===
  Solr includes support for normalizing Persian via {{{solr.PersianNormalizationFilterFactory}}},
and Lucene includes an example stopword list.
  
  {{{
@@ -267, +266 @@

  
  <!> Note: WordDelimiterFilter does not split on joiners by default. You can solve
this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory
which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider
using PositionFilter at query-time (only), as the QueryParser does not consider joiners and
could create unwanted phrase queries.
  
- ==== Portuguese ====
+ === Portuguese ===
  Solr includes support for stemming Portuguese via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
  
  {{{
@@ -279, +278 @@

  
  Example set of Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Romanian ====
+ === Romanian ===
  Solr includes support for stemming Romanian via {{{solr.SnowballPorterFilterFactory}}},
and Lucene includes an example stopword list.
  
  {{{
@@ -291, +290 @@

  
  Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Russian ====
+ === Russian ===
  Solr includes support for stemming Russian via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
  
  {{{
@@ -303, +302 @@

  
  Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Spanish ====
+ === Spanish ===
  Solr includes support for stemming Spanish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
  
  {{{
@@ -315, +314 @@

  
  Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]]
(Be sure to switch your browser encoding to UTF-8)
  
- ==== Swedish ====
+ === Swedish ===
  Solr includes support for stemming Swedish via {{{solr.SnowballPorterFilterFactory}}}, and
Lucene includes an example stopword list.
  
  {{{
@@ -329, +328 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Thai ====
+ === Thai ===
  Solr includes support for breaking Thai text into words via {{{solr.ThaiWordFilterFactory}}}
  
  {{{
@@ -340, +339 @@

  
  <!> Note: Be sure to use PositionFilter at query-time (only) as this language does
not use spaces between words.
  
- ==== Turkish ====
+ === Turkish ===
  Solr includes support for stemming Turkish via {{{solr.SnowballPorterFilterFactory}}}, as
well as support for case-insensitive search via {{{solr.TurkishLowerCaseFilterFactory}}} <!>
[[Solr3.1]], and Lucene includes an example stopword list.
  
  {{{
@@ -354, +353 @@

  
  <!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory <!>
[[Solr3.1]]
  
- === Not yet Integrated ===
+ == Not yet Integrated ==
  
  The following languages have explicit support in Lucene, but it is not yet integrated into
Solr. If you need to support these languages you might find this information useful in the
meantime.
  
- ==== Chinese, Japanese, Korean ====
+ === Chinese, Japanese, Korean ===
  
  Lucene provides support for Chinese word segmentation (SentenceTokenizer, WordTokenFilter)
in a separate jar file (lucene-analyzers-smartcn.jar). This component includes a large dictionary
and segments Chinese text into words with the Hidden Markov Model.
  
@@ -368, +367 @@

  
  <!> Note: Be sure to use PositionFilter at query-time (only) as this language does
not use spaces between words.
  
- ==== Polish ====
+ === Polish ===
  <!> [[Lucene3.1]]
  
  Lucene provides support for Polish stemming (StempelFilter) in a separate jar file (lucene-analyzers-stempel.jar).
This component includes an algorithmic stemmer with tables for Polish.
  
- ==== Lao, Myanmar, Khmer ====
+ === Lao, Myanmar, Khmer ===
  <!> [[Lucene3.1]]
  
  Lucene provides support for segmenting these languages into syllables (ICUTokenizer) in
a separate jar file (lucene-icu.jar).
  
  <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do
not use spaces between words. 
  
- === My language is not listed!!! ===
+ == My language is not listed!!! ==
  
  Your language might work anyway. A first step is to start with the "textgen" type in the
example schema. Remember, things like stemming and stopwords aren't mandatory for the search
to work, only optional language-specific improvements.
  
  If you have problems (your language is highly-inflectional, etc), you might want to try
using an n-gram approach as an alternative.
  
+ == Other Tips ==
  === Tokenization ===
  
  In general most languages don't require special tokenization (and will work just fine with
Whitespace + WordDelimiterFilter), so you can safely tailor the English "text" example schema
definition to fit.

Mime
View raw message