lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir
Date Fri, 25 Feb 2011 06:24:35 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir.
The comment on this change is: add docs for icu analysis factories.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=109&rev2=110

--------------------------------------------------

      </analyzer>
    </fieldType>
  }}}
+ 
+ === solr.ICUTokenizerFactory ===
+ <!> [[Solr3.1]] Uses [[http://site.icu-project.org/|ICU]]'s text bounds capabilities
to tokenize text.
+ 
+ This tokenizer first identifies the writing system "Script" for runs of text within the
document. Then, it tokenizes
+ the text according to rules or dictionaries depending upon the writing system. For example,
if it encounters
+ Thai, it will apply dictionary-based segmentation to split the Thai text (Thai uses no spaces
between words).
+ 
+ ||'''Input String'''||'''Output Tokens'''||'''Script Attribute'''||
+ ||Testing บริษัทชื่อ נאסק"ר||Testing<<BR>>บริษัท<<BR>>ชื่อ<<BR>>נאסק"ר||Latin<<BR>>Thai<<BR>>Thai<<BR>>Hebrew||
+ 
+ {{{
+     <fieldType name="text_icu" class="solr.TextField" autoGeneratePhraseQueries="false">
+       <analyzer>
+         <tokenizer class="solr.ICUTokenizerFactory"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ Note: to use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions
on which jars you need to add to your SOLR_HOME/lib
  
  == TokenFilterFactories ==
  
@@ -699, +719 @@

  <<Anchor(CollationKeyFilterFactory)>>
  
  === solr.CollationKeyFilterFactory ===
- <!> [[Solr1.5]]
+ <!> [[Solr3.1]]
  
  A filter that lets one specify:
  
@@ -715, +735 @@

   1. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/CollationKeyFilter.html|Lucene's
CollationKeyFilter javadocs]]
   1. UnicodeCollation
  
+ === solr.ICUCollationKeyFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter works like CollationKeyFilterFactory, except it uses ICU for collation. This
makes smaller and faster sort keys, and it supports more locales. See UnicodeCollation for
some more information, the same concepts apply.
+ 
+ The only configuration difference is that locales should be specified to this filter with
RFC 3066 locale IDs.
+ 
+ {{{
+     <fieldType name="icu_sort_en" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.KeywordTokenizerFactory"/>
+         <filter class="solr.ICUCollationKeyFilterFactory" locale="en" strength="primary"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on
which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUNormalizer2FilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter normalizes text to a [[http://unicode.org/reports/tr15/|Unicode Normalization
Form]].
+ 
+ {{{
+     <fieldType name="normalized" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ These are the supported normalization forms: 
+ {{{
+ NFC: name="nfc" mode="compose"
+ NFD: name="nfc" mode="decompose"
+ NFKC: name="nfkc" mode="compose"
+ NFKD: name="nfkc" mode="decompose"
+ NFKC_Casefold: name="nfkc_cf" mode="compose"
+ }}}
+ 
+ NFKC_Casefold (nfkc_cf) means applying the Unicode Case-Folding algorithm in conjunction
with NFKC normalization. Unicode Case-Folding is more than lowercasing, e.g. it handles cases
like ß/SS. Behind the scenes this is its own form (nfkc_cf), but both algorithms have been
recursively computed across all of Unicode offline, so that its an efficient single-pass algorithm.
+ For practical purposes this means you can use this factory with nfkc_cf as a better substitute
for the combined behavior of LowerCaseFilter and NFKC normalization.
+ 
+ If you want to do more advanced normalization (e.g. apply a filter to work only on a subset
of Unicode), see the javadocs.
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on
which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUFoldingFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter is a custom unicode normalization form that applies the foldings specified in
[[http://www.unicode.org/reports/tr30/tr30-4.html|UTR#30]] in addition to NFKC_Casefold.
+ 
+ {{{
+     <fieldType name="folded" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUFoldingFilterFactory"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ This means NFKC normalization, Unicode case folding, and search term folding (removing accents,
etc) have been recursively computed across all of Unicode offline, so that its an efficient
single-pass through the string.
+ For practical purposes this means you can use this factory as a better substitute for the
combined behavior of ASCIIFoldingFilter, LowerCaseFilter, and ICUNormalizer2Filter
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on
which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUTransformFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter applies [[http://userguide.icu-project.org/transforms/general|ICU Transforms]]
to text.
+ 
+ Currently the filter only supports System transforms (or compounds consisting of), and custom
rulesets are not yet supported.
+ 
+ {{{
+     <fieldType name="transformed" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ You can see a list of the supported System transforms by going to [[http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html|this
link]], clicking the drop-down, and scrolling down to System.
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on
which jars you need to add to your SOLR_HOME/lib
+ 

Mime
View raw message