lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (Confluence)" <conflue...@apache.org>
Subject [CONF] Apache Solr Reference Guide > Language Analysis
Date Thu, 11 Jul 2013 22:34:00 GMT
Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Language Analysis (https://cwiki.apache.org/confluence/display/solr/Language+Analysis)

Change Comment:
---------------------------------------------------------------------
SOLR-5031: initial step of completely removing problematic examples

Edited by Hoss Man:
---------------------------------------------------------------------
{section}
{column:width=60%}

This section contains information about tokenizers and filters related to character set conversion
or for use with specific languages. For the European languages, tokenization is fairly straightforward.
Tokens are delimited by white space and/or a relatively small set of punctuation characters.
In other languages the tokenization rules are often not so simple. Some European languages
may require special tokenization rules as well, such as rules for decompounding German words.

For information about language detection at index time, see [Detecting Languages During Indexing].
{column}

{column:width=40%}
{panel}
Topics discussed in this section:
{toc:maxLevel=2}
{panel}
{column}
{section}

h2. KeyWordMarkerFilterFactory

Protects words from being modified by stemmers. A customized protected word list may be specified
with the "protected" attribute in the schema. Any words in the protected word list will not
be modified by any stemmer in Solr.

A sample Solr {{protwords.txt}} with comments can be found in the {{/solr/conf/}} directory:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
    <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
</fieldtype>
{code}
{topofpage}

h2. StemmerOverrideFilterFactory

Overrides stemming algorithms by applying a custom mapping, then protecting these terms from
being modified by stemmers.

A customized mapping of words to stems, in a tab-separated file, can be specified to the "dictionary"
attribute in the schema. Words in this mapping will be stemmed to the stems from the file,
and will not be further changed by any stemmer.

A sample [stemdict.txt|http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/solr/core/src/test-files/solr/collection1/conf/stemdict.txt]
with comments can be found in the Source Repository.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
    <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
</fieldtype>
{code}
{topofpage}
h2. Dictionary Compound Word Token Filter

This filter splits, or _decompounds_, compound words into individual words using a dictionary
of the component words. Each input token is passed through unchanged. If it can also be decompounded
into subwords, each subword is also added to the stream at the same logical position.

Compound words are most commonly found in Germanic languages.

*Factory class:* solr.DictionaryCompoundWordTokenFilterFactory

*Arguments:*

{{dictionary}}: (required) The path of a file that contains a list of simple words, one per
line.  Blank lines and lines that begin with "#" are ignored.  This path may be an absolute
path, or path relative to the Solr config directory.

{{minWordSize}}: (integer, default 5) Any token shorter than this is not decompounded.

{{minSubwordSize}}: (integer, default 2) Subwords shorter than this are not emitted as tokens.

{{maxSubwordSize}}: (integer, default 15) Subwords longer than this are not emitted as tokens.

{{onlyLongestMatch}}: (true/false)  If true (the default), only the longest matching subwords
will generate new tokens.

*Example:*

Assume that {{germanwords.txt}} contains at least the following words:

{{dummkopfdonaudampfschiff}}

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>
{code}

*In:* "Donaudampfschiff dummkopf"

*Tokenizer to Filter:* "Donaudampfschiff"(1), "dummkopf"(2),

*Out:* "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2),
"kopf"(2)
{topofpage}
h2. Unicode Collation

Unicode Collation is a language-sensitive method of sorting text that also be used for advanced
search purposes.

Unicode Collation in Solr is fast, because all the work is done at index time. It uses a {{KeywordTokenizerFactory}}
to create a sort field, followed by {{CollationKeyFilterFactory}}. The {{CollationKeyFilterFactory}}
adds "sort keys" to the {{sort}} field at index time, so that at query time you can sort on
the {{sort}} field and your results comes back in collated order.

You can also name {{CollatedField}} and {{ICUCollatedField}} to hold the results of your collation.

h3. Sorting Text for a Specific Language

In this example, text is sorted according to the default German rules provided by Java. The
rules for sorting German in Java are defined in a package called a Java Locale.

Locales are typically defined as a combination of language and country, but you can specify
just the language if you want. For example, if you specify "de" as the language, you will
get sorting that works well for German language. If you specify "de" as the language and "CH"
as the country, you will get German sorting specifically tailored for Switzerland.

You can see a list of supported Locales [here|http://java.sun.com/j2se/1.5.0/docs/guide/intl/locale.doc.html#util-text].

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<!-- define a field type for German collation -->
<fieldType name="collatedGERMAN" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language="de"
        strength="primary"
    />
  </analyzer>
</fieldType>
...
<!-- define a field to store the German collated manufacturer names -->
<field name="manuGERMAN" type="collatedGERMAN" indexed="true" stored="false" />
...
<!-- copy the text to this field. we could create French, English, Spanish versions too,
     and sort differently for different users! --
<copyField source="manu" dest="manuGERMAN"/>
{code}

In the example above, we defined the strength as "primary". The strength of the collation
determines how strict the sort order will be, but it also depends upon the language. For example,
in English, "primary" strength ignores differences in case and accents.

For more information, see the [Collator javadocs|http://java.sun.com/j2se/1.5.0/docs/api/java/text/Collator.html].

h3. Sorting Text for Multiple Languages

There are two approaches to supporting multiple languages: if there is a small list of languages
you wish to support, consider defining collated fields for each language and using {{copyField}}.
However, adding a large number of sort fields can increase disk and indexing costs. An alternative
approach is to use the Unicode {{default}} collator.

The Unicode {{default}} or {{ROOT}} locale has rules that are designed to work well for most
languages. To use the {{default}} locale, simply define the language as the empty string.
This Unicode default sort is still significantly more advanced than the standard Solr sort.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="collatedROOT" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language=""
        strength="primary"
    />
  </analyzer>
</fieldType>
{code}

h3. Sorting Text with Custom Rules

You can define your own set of sorting rules. Its easiest to take existing rules that are
close to what you want and customize them.

In the example below, we create a custom rule set for German called DIN 5007-2. This rule
set treats umlauts in German differently: it treats ö as equivalent to oe. For more information,
see the [RuleBasedCollator javadocs|http://java.sun.com/j2se/1.5.0/docs/api/java/text/RuleBasedCollator.html].

This example shows how to create a custom rule set and dump it to a file:

{code:language=java|borderStyle=solid|borderColor=#666666}
// get the default rules for Germany
// these are called DIN 5007-1 sorting
RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new Locale("de",
"DE"));

// define some tailorings, to make it DIN 5007-2 sorting.
// For example, this makes ö equivalent to oe
String DIN5007_2_tailorings =
"& ae , a\u0308 & AE , A\u0308"+
"& oe , o\u0308 & OE , O\u0308"+
"& ue , u\u0308 & UE , u\u0308";

// concatenate the default rules to the tailorings, and dump it to a String
RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() + DIN5007_2_tailorings);
String tailoredRules = tailoredCollator.getRules();
// write these to a file, be sure to use UTF-8 encoding!!!
IOUtils.write(tailoredRules, new FileOutputStream("/solr_home/conf/customRules.dat"), "UTF-8");
{code}

This rule set can now be used for custom collation in Solr:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
{code}

h3. Searching

Collation can also be used to search on a tokenized field.

In this example, we use the same custom German rules defined above on a tokenized field. As
with stemmers, although the output tokens are nonsense they are the same values and will match
for search purposes.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
{code}

h3. Collation Key Filter

The filter {{solr.CollationKeyFilter}} is used at index time, indexing special "sort keys"
into the sort field. It lets you choose  the collator related to the target country and language.
You can also choose  the strength of the collation which determines the minimum level of difference
considered significant during comparison. For example:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<filter class="solr.CollationKeyFilterFactory" language="es" country="ES" strength="primary"
/> 
{code}

The example above shows the configuration of the {{CollationKeyFilterFactory}}, where we want
to handle the Spanish language with primary strength.

You can add the filter into field type definitions, as in the example below:

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="polishLowercase" positionIncrementGap="100>
   <analyzer>
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.TrimFilterFactory"/>
     <filter class="solr.CollationKeyFilterFactory: language="pl" country="PL" strength="primary"/>
   </analyzer>
</fieldType>
{code}

Handling the Polish language has been added to the definition of the currently existing {{lowercase}}
type. The type will be used for the fields, where the data contains Polish signs. For example,
you could also change the type for the {{city_sort}} field to {{polishLowercase}}:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<field name="city_sort" type="polishLowercase" indexed="true" stored="false" />
{code}

You can check the test query result:

{code:language=none|borderStyle=solid|borderColor=#666666}
q=*:*&fl=city&sort=city_sort+asc
{code}

And the result may look like this:

{code:xml|borderStyle=solid|borderColor=#666666}
<result name="response" numFound="6" start="0">
   <doc>
      <str name="city">Białystok</str>
   </doc>
   <doc>
      <str name="city">Koszalin</str>
   </doc>
   <doc>
      <str name="city">Łowicz</str>
   </doc>
   <doc>
      <str name="city">Szczecin</str>
   </doc>
   <doc>
      <str name="city">Świdnik</str>
   </doc>
   <doc>
      <str name="city">Warszawa</str>
   </doc>
</result>
{code}

h3. ICU Collation

For better performance, less memory usage, and support for more locales, you can add the {{analysis-extras}}
contrib and use {{ICUCollationKeyFilterFactory}} instead. See the [javadocs|http://lucene.apache.org/solr/4_0_0/solr-analysis-extras/org/apache/solr/schema/ICUCollationField.html]
for more information.

The principles of ICU Collation are the same as those of Unicode Collation; you just specify
an RFC3066 language identifier with the locale parameter instead of specifying {{language+country+variant}}.

For example, to get German phonebook sort order:

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="collatedICU" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.ICUCollationKeyFilterFactory"
        locale="de@collation=phonebook"
        strength="primary"
    />
  </analyzer>
</fieldType>
{code}

To use the {{ICUCollationKeyFilterFactory}} filter, see {{solr/contrib/analysis-extras/README.txt}}
for instructions on which jars you need to add to your {{SOLR_HOME/lib}}.
{topofpage}
h2. ISO Latin Accent Filter

This filter replaces any accented characters in a token with the unaccented equivalent. This
can increase recall by causing more matches. On the other hand, it can reduce precision because
language-specific character differences may be lost.

Characters in the ISO Latin 1 (ISO-8859-1) character set are recognized and letter case will
be preserved, so that "Â" becomes "A" and "á" becomes "a".

{note}
This filter only looks for accented characters, it does not filter out other non-ASCII characters.
{note}

*Factory class:* solr.ISOLatin1AccentFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ISOLatin1AccentFilterFactory"/>
</analyzer>
{code}

*In:* "Björn Ångström"

*Tokenizer to Filter:* "Björn", "Ångström"

*Out:* "Bjorn", "Angstrom"
{topofpage}
h2. Language-Specific Factories

These factories are each designed to work with specific languages. The languages covered here
are:

{section}
{column:width=25%}
* [Arabic|#Arabic]
* [Brazilian Portuguese|#Brazilian Portuguese]
* [Bulgarian|#Bulgarian]
* [Chinese|#Chinese]
* [Simplified Chinese|#Simplified Chinese]
* [CJK|#CJK]
* [Czech|#Czech]
{column}

{column:width=25%}
* [Dutch|#Dutch]
* [Finnish|#Finnish]
* [French|#French]
* [Galician|#Galician]
* [German|#German]
* [Greek|#Greek]
* [Hindi|#Hindi]
{column}

{column:width=25%}
* [Indonesian|#Indonesian]
* [Italian|#Italian]
* [Kuromoji (Japanese)|#Kuromoji (Japanese)]
* [Lao, Myanmar, Khmer|#Lao, Myanmar, Khmer]
* [Latvian|#Latvian]
* [Norwegian|#Norwegian]
* [Persian|#Persian]
{column}

{column:width=25%}
* [Polish|#Polish]
* [Portuguese|#Portuguese]
* [Russian|#Russian]
* [Spanish|#Spanish]
* [Swedish|#Swedish]
* [Thai|#Thai]
* [Turkish|#Turkish]
{column}
{section}

h3. Arabic

Solr provides support for the [Light-10|http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf]
(PDF) stemming algorithm, and Lucene includes an example stopword list.

This algorithm defines both character normalization and stemming, so these are split into
two filters to provide more flexibility.

*Factory classes:* solr.ArabicStemFilterFactory, solr.ArabicNormalizationFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.ArabicNormalizationFilterFactory"/>
  <filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
{code}

{topofpage}
h3. Brazilian Portuguese

This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese
language. It uses the Lucene class {{org.apache.lucene.analysis.br.BrazilianStemmer}}. Although
that stemmer can be configured to use a list of protected words (which should not be stemmed),
this factory does not accept any arguments to specify such a list.

*Factory class:* solr.BrazilianStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.BrazilianStemFilterFactory"/>
</analyzer>
{code}

*In:* "praia praias"

*Tokenizer to Filter:* "praia", "praias"

*Out:* "pra", "pra"
{topofpage}
h3. Bulgarian

Solr includes a light stemmer for Bulgarian, following [this algorithm|http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf]
(PDF), and Lucene includes an example stopword list.

*Factory class:* solr.BulgarianStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.BulgarianStemFilterFactory"/>
</analyzer>
{code}

{topofpage}
h3. Chinese

h4. Chinese Tokenizer

The Chinese Tokenizer is deprecated as of Solr 3.4. Use the [{{solr.StandardTokenizerFactory}}|Tokenizers#Standard
Tokenizer] instead.

*Factory class:* solr.ChineseTokenizerFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.ChineseTokenizerFactory"/>
</analyzer>
{code}

h4. Chinese Filter Factory

The Chinese Filter Factory is deprecated as of Solr 3.4. Use the [{{solr.StopFilterFactory}}|Filter
Descriptions#Stop Filter] instead.

*Factory class:* solr.ChineseFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ChineseFilterFactory"/>
</analyzer>
{code}

{topofpage}
h3. Simplified Chinese

For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with
the {{solr.SmartChineseSentenceTokenFilterFactory}} and {{solr.SmartChineseWordTokenFilterFactory}}
in the {{analysis-extras}} contrib module. This component includes a large dictionary and
segments Chinese text into words with the Hidden Markov Model. To use this filter, see {{solr/contrib/analysis-extras/README.txt}}
for instructions on which jars you need to add to your {{solr_home/lib}}.

*Factory class:* solr.SmartChineseWordTokenFilterFactory

*Arguments:* None

*Examples:*

To use the default setup with fallback to English Porter stemmer for english words, use:

{{<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>}}

Or to configure your own analysis setup, use the {{SmartChineseSentenceTokenizerFactory}}
along with your custom filter setup. The sentence tokenizer tokenizes on sentence boundaries
and the {{SmartChineseWordTokenFilter}} breaks this further up into words.

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
 <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
 <filter class="solr.SmartChineseWordTokenFilterFactory"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.PositionFilterFactory" />
</analyzer>
{code}
{topofpage}
h3. CJK

This tokenizer breaks Chinese, Japanese and Korean language text into tokens. These are not
whitespace delimited languages. The tokens generated by this tokenizer are "doubles", overlapping
pairs of CJK characters found in the field text.

*Factory class:* solr.CJKTokenizerFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.CJKTokenizerFactory"/>
</analyzer>
{code}

{topofpage}
h3. Czech

Solr includes a light stemmer for Czech, following [this algorithm|https://dl.acm.org/citation.cfm?id=1598600],
and Lucene includes an example stopword list.

*Factory class:* solr.CzechStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.CzechStemFilterFactory"/>
<analyzer>
{code}

*In:* "prezidenští, prezidenta, prezidentského"

*Tokenizer to Filter:* "prezidenští", "prezidenta", "prezidentského"

*Out:* "preziden", "preziden", "preziden"
{topofpage}
h3. Dutch

This is a Java filter written specifically for stemming the Dutch language. It uses the Lucene
class {{org.apache.lucene.analysis.nl.DutchStemmer}}. Although that stemmer can be configured
to use a list of protected words (which should not be stemmed), this factory does not accept
any arguments to specify such a list.

Another option for stemming Dutch words is to use the Snowball Porter Stemmer with an argument
of {{language="Dutch"}}.

*Factory class:* solr.DutchStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.DutchStemFilterFactory"/>
</analyzer>
{code}

*In:* "kanaal kanalen"

*Tokenizer to Filter:* "kanaal", "kanalen"

*Out:* "kanal", "kanal"
{topofpage}
h3. Finnish

Solr includes support for stemming Finnish, and Lucene includes an example stopword list.

*Factory class:* solr.FinnishLightStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
 <analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.FinnishLightStemFilterFactory"/>
</analyzer>
{code}

*In:* "kala kalat"

*Tokenizer to Filter:* "kala", "kalat"

*Out:* "kala", "kala"
{topofpage}
h3. French

h4. Elision Filter

Removes article elisions from a token stream. This filter primarily applies to the French
language and makes use of the ElisionFilter class in {{org.apache.lucene.analysis.fr}}.

*Factory class:* solr.ElisionFilterFactory

*Arguments:*

{{articles}}: (required) The pathname of a file that contains a list of articles, one per
line, to be stripped. Articles are words such as "le", which are commonly abbreviated, such
as _l'avion_ (the plane). This file should include the abbreviated form, which precedes the
apostrophe. In this case, simply "_l_".

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ElisionFilterFactory"/>
</analyzer>
{code}

*In:* "L'histoire d'art"

*Tokenizer to Filter:* "L'histoire", "d'art"

*Out:* "histoire", "art"

h4. French Light Stem Filter

Solr includes three stemmers for French: one in the {{solr.SnowballPorterFilterFactory}},
a lighter stemmer called {{solr.FrenchLightStemFilterFactory}}, and an even less aggressive
stemmer called {{solr.FrenchMinimalStemFilterFactory}}. Lucene includes an example stopword
list.

*Factory classes:* solr.FrenchLightStemFilterFactory, solr.FrenchMinimalStemFilterFactory

*Arguments:* None

*Examples:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.ElisionFilterFactory"/>
  <filter class="solr.FrenchLightStemFilterFactory"/>
</analyzer>
{code}

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.ElisionFilterFactory"/>
  <filter class="solr.FrenchMinimalStemFilterFactory"/>
</analyzer>
{code}

*In:* "le chat, les chats"

*Tokenizer to Filter:* "le", "chat", "les", "chats"

*Out:* "le", "chat", "le", "chat"
{topofpage}
h3. Galician

Solr includes a stemmer for Galician following [this algorithm|http://bvg.udc.es/recursos_lingua/stemming.jsp],
and Lucene includes an example stopword list.

*Factory class:* solr.GalicianStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.GalicianStemFilterFactory"/>
</analyzer>
{code}

*In:* "felizmente Luzes"

*Tokenizer to Filter:* "felizmente", "luzes"

*Out:* "feliz", "luz"
{topofpage}
h3. German

Solr includes four stemmers for German: one in the {{solr.SnowballPorterFilterFactory language="German"}},
a stemmer called {{solr.GermanStemFilterFactory}}, a lighter stemmer called {{solr.GermanLightStemFilterFactory}},
and an even less aggressive stemmer called {{solr.GermanMinimalStemFilterFactory}}. Lucene
includes an example stopword list.

*Factory classes:* solr.GermanStemFilterFactory, solr.LightGermanStemFilterFactory, solr.MinimalGermanStemFilterFactory

*Arguments:* None

*Examples:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.GermanStemFilterFactory"/>
</analyzer>
{code}

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
{code}

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory "/>
  <filter class="solr.GermanMinimalStemFilterFactory"/>
</analyzer>
{code}

*In:* "hund hunden"

*Tokenizer to Filter:* "hund", "hunden"

*Out:* "hund", "hund"
{topofpage}
h3. Greek

This filter converts uppercase letters in the Greek character set to the equivalent lowercase
character.

*Factory class:* solr.GreekLowerCaseFilterFactory

*Arguments:*

{{charset}}: (optional, default "UnicodeGreek") Specifies the name of the character set to
use. Must be "UnicodeGreek", "ISO" or "CP1253".

{note}
Use of custom charsets was deprecated in Solr 1.4 and is unsupported in Solr 3.1. If you need
to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader,
and so on.) during I/O, so that Lucene can analyze this text as Unicode instead.
{note}

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.GreekLowerCaseFilterFactory"/>
</analyzer>
{code}

{topofpage}
h3. Hindi

Solr includes support for stemming Hindi following [this algorithm|http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf]
(PDF), support for common spelling differences through the {{solr.HindiNormalizationFilterFactory}},
support for encoding differences through the {{solr.IndicNormalizationFilterFactory}} following
[this algorithm|http://ldc.upenn.edu/myl/IndianScriptsUnicode.html], and Lucene includes an
example stopword list.

*Factory classes:* solr.IndicNormalizationFilterFactory, solr.HindiNormalizationFilterFactory,
solr.HindiStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
  <filter class="solr.IndicNormalizationFilterFactory"/>
  <filter class="solr.HindiNormalizationFilterFactory"/>
  <filter class="solr.HindiStemFilterFactory"/>
{code}

{topofpage}
h3. Indonesian

Solr includes support for stemming Indonesian (Bahasa Indonesia) following [this algorithm|http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf]
(PDF), and Lucene includes an example stopword list.

*Factory class:* solr.IndonesianStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true" />
</analyzer>
{code}

*In:* "sebagai sebagainya"

*Tokenizer to Filter:* "sebagai", "sebagainya"

*Out:* "bagai", "bagai"
{topofpage}
h3. Italian

Solr includes two stemmers for Italian: one in the {{solr.SnowballPorterFilterFactory language="Italian"}},
and a lighter stemmer called {{solr.ItalianLightStemFilterFactory}}. Lucene includes an example
stopword list.

*Factory class:* solr.ItalianStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.ItalianLightStemFilterFactory"/>
</analyzer>
{code}

*In:* "propaga propagare propagamento"

*Tokenizer to Filter:* "propaga", "propagare", "propagamento"

*Out:* "propag", "propag", "propag"
{topofpage}
h3. Kuromoji (Japanese)

Solr includes support for stemming Kuromoji (Japanese), and Lucene includes an example stopword
list. Kuromoji has a search mode (default) that does segmentation useful for search. A heuristic
is used to segment compounds into its parts and the compound itself is kept as a synonym.

With Solr 4, the {{JapaneseIterationMarkCharFilterFactory}} now is included to normalize Japanese
iteration marks.

You can also make discarding punctuation configurable in the {{JapaneseTokenizerFactory}},
by setting {{discardPunctuation}} to {{false}} (to show punctuation) or {{true}} (to discard
punctuation), as in the following example:

*Factory class:* {{solr.KuromojiStemFilterFactory}}

*Arguments:* 

{{mode}}: Use search-mode to get a noun-decompounding effect useful for search. Search mode
improves segmentation for search at the expense of part-of-speech accuracy. Valid values for
mode are:

* {{normal}}: default segmentation
* {{search}}: segmentation useful for search (extra compound splitting)
* {{extended}}: search mode with unigramming of unknown words (experimental)

For some applications it might be good to use search mode for indexing and normal mode for
queries to reduce recall and prevent parts of compounds from being matched and highlighted.


Kuromoji also has a convenient user dictionary feature that allows overriding the statistical
model with your own entries for segmentation, part-of-speech tags and readings without a need
to specify weights. Note that user dictionaries have not been subject to extensive testing.
User dictionary attributes are:

{{userDictionary}}: user dictionary filename
{{userDictionaryEncoding}}: user dictionary encoding (default is UTF-8)

See {{lang/userdict_ja.txt}} for a sample user dictionary file.

Punctuation characters are discarded by default. Use {{discardPunctuation="false"}} to keep
them.

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="text_ja" positionIncrementGap="100" autoGeneratePhraseQueries="false">
   <analyzer>
      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
      <filter class="solr.JapaneseBaseFormFilterFactory"/>
      <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"
enablePositionIncrements="true"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"
enablePositionIncrements="true" />
      <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
      <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
</fieldType>
{code}

{topofpage}
h3. Lao, Myanmar, Khmer

Lucene provides support for segmenting these languages into syllables with the {{solr.ICUTokenizerFactory}}
in the {{analysis-extras}} contrib module. To use this tokenizer, see {{solr/contrib/analysis-extras/README.txt
for}} instructions on which jars you need to add to your {{solr_home/lib}}.
{topofpage}
h3. Latvian

Solr includes support for stemming Latvian, and Lucene includes an example stopword list.

*Factory class:* solr.LatvianStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="text_lvstem" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.LatvianStemFilterFactory"/>
  </analyzer>
</fieldType>
{code}

*In:* "tirgiem tirgus"

*Tokenizer to Filter:* "tirgiem", "tirgus"

*Out:* "tirg", "tirg"
{topofpage}
h3. Norwegian

Solr includes two classes for stemming Norwegian, {{NorwegianLightStemFilterFactory}} and
{{NorwegianMinimalStemFilterFactory}}. Lucene includes an example stopword list.

h4. Norwegian Light Stemmer

The {{NorwegianLightStemFilterFactory}} requires a "two-pass" sort for the \-dom and \-het
endings. This means that in the first pass the word "kristendom" is stemmed to "kristen",
and then all the general rules apply so it will be further stemmed to "krist". The effect
of this is that "kristen," "kristendom," "kristendommen," and "kristendommens" will all be
stemmed to "krist."

The second pass is to pick up \-dom and \-het endings. Consider this example:

|| *One pass* || || *Two passes* || ||
| *Before* | *After* | *Before* | *After* |
| forlegen | forleg | forlegen | forleg |
| forlegenhet | forlegen | forlegenhet | forleg |
| forlegenheten | forlegen | forlegenheten | forleg |
| forlegenhetens | forlegen | forlegenhetens | forleg |
| firkantet | firkant | firkantet | firkant |
| firkantethet | firkantet | firkantethet | firkant |
| firkantetheten | firkantet | firkantetheten | firkant |

*Factory class:* {{solr.NorwegianLightStemFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
   <analyzer> 
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_no.txt"
format="snowball" enablePositionIncrements="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Norwegian"/>
      <filter class="solr.NorwegianLightStemFilterFactory"/>
   </analyzer>
</fieldType>
{code}

*In:* "Forelskelsen"

*Tokenizer to Filter:* "forelskelsen"

*Out:* "forelske"

h4. Norwegian Minimal Stemmer

The {{NorwegianMinimalStemFilterFactory}} stems plural forms of Norwegian nouns only.

*Factory class:* {{solr.NorwegianMinimalStemFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
   <analyzer> 
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_no.txt"
format="snowball" enablePositionIncrements="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Norwegian"/>
      <filter class="solr.NorwegianMinimalStemFilterFactory"/>
   </analyzer>
</fieldType>
{code}

*In:* "Bilens"

*Tokenizer to Filter:* "bilens"

*Out:* "bil"

{topofpage}
h3. Persian

h4. Persian Filter Factories

Solr includes support for normalizing Persian, and Lucene includes an example stopword list.

*Factory class:* solr.PersianNormalizationFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.ArabicNormalizationFilterFactory"/>
  <filter class="solr.PersianNormalizationFilterFactory">
</analyzer>
{code}


{topofpage}
h3. Polish

Solr provides support for Polish stemming with the {{solr.StempelPolishStemFilterFactory}}
in the {{contrib/analysis-extras}} module. This component includes an algorithmic stemmer
with tables for Polish. To use this filter, see {{solr/contrib/analysis-extras/README.txt}}
for instructions on which jars you need to add to your {{solr_home/lib}}.

*Factory class:* {{solr.StempelPolishStemFilterFactory}}

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.solr.StempelPolishStemFilterFactory"/>
</analyzer>
{code}

*In:* ""studenta studenci"

*Tokenizer to Filter:* "studenta", "studenci"

*Out:* "student", "student"

More information about the Stempel stemmer is available in the Lucene javadocs, [https://lucene.apache.org/core/4_0_0/analyzers-stempel/index.html].

{topofpage}
h3. Portuguese

Solr includes four stemmers for Portuguese: one in the {{solr.SnowballPorterFilterFactory}},
an alternative stemmer called {{solr.PortugueseStemFilterFactory}}, a lighter stemmer called
{{solr.PortugueseLightStemFilterFactory}}, and an even less aggressive stemmer called {{solr.PortugueseMinimalStemFilterFactory}}.
Lucene includes an example stopword list.

*Factory class:* solr.PortugueseStemFilterFactory, solr.PortugueseLightStemFilterFactory,
solr.PortugueseMinimalStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.PortugueseStemFilterFactory"/>
</analyzer>
{code}

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.PortugueseLightStemFilterFactory"/>
</analyzer>
{code}

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.PortugueseMinimalStemFilterFactory"/>
</analyzer>
{code}

*In:* "praia praias"

*Tokenizer to Filter:* "praia", "praias"

*Out:* "pra", "pra"
{topofpage}
h3. Russian

h4. Russian Letter Tokenizer

This tokenizer breaks Russian language text into tokens. It is similar to LetterTokenizer,
but additionally looks up letters in the appropriate Russian character set.

*Factory class:* solr.RussianLetterTokenizerFactory

*Arguments:*

{{charset}}: (optional, default "UnicodeRussian") The name of the character set to use.  Must
be "UnicodeRussian", "KOI8" or "CP1251".

{note}
Use of custom charsets was deprecated in Solr 1.4 and is unsupported in Solr 3.1. If you need
to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader,
and so on.) during I/O, so that Lucene can analyze this text as Unicode instead.
{note}

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.RussianLetterTokenizerFactory"/>
</analyzer>
{code}

h4. Russian Lower Case Filter

This filter converts uppercase letters in the Russian character set to the equivalent lowercase
character.

*Factory class:* solr.RussianLowerCaseFilterFactory

*Arguments:*

{{charset}}: (optional, default "UnicodeRussian") Specifies the name of the character set
to use. Must be "UnicodeRussian", "KOI8" or "CP1251".

{note}
Use of custom charsets was deprecated in Solr 1.4 and is unsupported in Solr 3.1. If you need
to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader,
and so on.) during I/O, so that Lucene can analyze this text as Unicode instead.
{note}

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.RussianLowerCaseFilterFactory"/>
</analyzer>
{code}

h4. Russian Stem Filter

Solr includes two stemmers for Russian: one in the {{solr.SnowballPorterFilterFactory language="Russian"}},
and a lighter stemmer called {{solr.RussianLightStemFilterFactory}}. Lucene includes an example
stopword list.

*Factory class:* solr.RussianLightStemFilterFactory

*Arguments:*

charset: (optional, default "UnicodeRussian") Specifies the name of the character set to use.
Must be  "UnicodeRussian", "KOI8" or "CP1251".

{note}
Use of custom charsets was deprecated in Solr 1.4 and is unsupported in Solr 3.4. If you need
to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader,
and so on.) during I/O, so that Lucene can analyze this text as Unicode instead.
{note}

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.RussianLowerCaseFilterFactory"/>
  <filter class="solr.RussianLightStemFilterFactory"/>
</analyzer>
{code}

{topofpage}
h3. Spanish

Solr includes two stemmers for Spanish: one in the {{solr.SnowballPorterFilterFactory language="Spanish"}},
and a lighter stemmer called {{solr.SpanishLightStemFilterFactory}}. Lucene includes an example
stopword list.

*Factory class:* solr.SpanishStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
{code}

*In:* "torear toreara torearlo"

*Tokenizer to Filter:* "torear", "toreara", "torearlo"

*Out:* "tor", "tor", "tor"
{topofpage}
h3. Swedish

h4. Swedish Stem Filter

Solr includes two stemmers for Swedish: one in the {{solr.SnowballPorterFilterFactory language="Swedish"}},
and a lighter stemmer called {{solr.SwedishLightStemFilterFactory}}. Lucene includes an example
stopword list.

*Factory class:* solr.SwedishStemFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SwedishLightStemFilterFactory"/>
</analyzer>
{code}

*In:* "kloke klokhet klokheten"

*Tokenizer to Filter:* "kloke", "klokhet", "klokheten"

*Out:* "klok", "klok", "klok"
{topofpage}
h3. Thai

This filter converts sequences of Thai characters into individual Thai words. Unlike European
languages, Thai does not use whitespace to delimit words.

*Factory class:* solr.ThaiWordFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ThaiWordFilterFactory"/>
</analyzer>
{code}

{topofpage}
h3. Turkish

Solr includes support for stemming Turkish through the {{solr.SnowballPorterFilterFactory}},
as well as support for case-insensitive search through the {{solr.TurkishLowerCaseFilterFactory}},
and Lucene includes an example stopword list.

*Factory class:* solr.TurkishLowerCaseFilterFactory

*Arguments:* None

*Example:*

{code:lang=xml|borderStyle=solid|borderColor=#666666}
  <filter class="solr.TurkishLowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Turkish" />
{code}
{topofpage}
h2. Related Topics

* [LanguageAnalysis|http://wiki.apache.org/solr/LanguageAnalysis]


{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action


   

Mime
View raw message