lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageAnalysis" by JanHoydahl
Date Sun, 06 Nov 2011 21:16:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=16&rev2=17

Comment:
Example for SmartChineseSentenceTokenizerFactory

  === Chinese, Japanese, Korean ===
  Lucene provides support for these languages with CJKTokenizer, which indexes bigrams and
does some character folding of full-width forms.
  
+ {{{
+    <tokenizer class="solr.CJKTokenizerFactory"/>
+ ...
+ }}}
+ 
  <!> [[Solr3.1]] Alternatively, for Simplified Chinese, Solr provides support for Chinese
word segmentation {{{solr.SmartChineseWordTokenFilterFactory}}} in the analysis-extras contrib
module. This component includes a large dictionary and segments Chinese text into words with
the Hidden Markov Model. To use this filter, see solr/contrib/analysis-extras/README.txt for
instructions on which jars you need to add to your SOLR_HOME/lib
  
+ To use the default setup with fallback to English Porter stemmer for english words, use:
  {{{
+    <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
+ }}}
+ 
+ Or to configure your own analysis setup, use the SmartChineseSentenceTokenizerFactory along
with your custom filter setup. The sentence tokenizer tokenizes on sentence boundaries and
the SmartChineseWordTokenFilter breaks this further up into words.
+ {{{
+   <analyzer>
-    <tokenizer class="solr.CJKTokenizerFactory"/>
+     <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
- ...
+     <filter class="solr.SmartChineseWordTokenFilterFactory"/>
+     <filter class="solr.LowerCaseFilterFactory"/>
+     <filter class="solr.PositionFilterFactory" />
+   </analyzer>
  }}}
  
  <!> Note: Be sure to use [[AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory|PositionFilter]]
at query-time (only) as these languages do not use spaces between words. 

Mime
View raw message