lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: Language specific tokenizer for purpose of multilingual search in single-core solr,
Date Tue, 14 Feb 2012 08:45:24 GMT
only one field element?
There should be two or?
One for each language.

paul


Le 14 févr. 2012 à 07:34, bing a écrit :

> 
> Hi, all, 
> 
> I want to do multilingual search in single-core solr. That requires to
> define language specific tokenizers in scheme.xml. Say for example, I have
> two tokenizers, one for English ("en") and one for simplified Chinese
> ("zh-cn"). Can I just put following definitions together in one schema.xml,
> and both sets of the files ( stopwords, synonym, and protwords) in one
> directory? 
> 
> 
> 1. fieldType and field definition for english ("en")  
> 
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index" language="en">
>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>    <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt" enablePositionIncrements="true" />
>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.SnowballPorterFilterFactory" 
> protected="protwords_en.txt"/>
>  </analyzer>
>  .....
> </fieldType>
> 
> <field name="text_en" type="text_en" indexed="true" stored="false"
> multiValued="true"/>
> 
> 
> 2. fieldType and field definition for Chinese ("zh_cn")  
> 
> <fieldType name="text_zh_ch" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index" language="zh_cn">
>    <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory"/>/>
>    <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_ch.txt" enablePositionIncrements="true" />
>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>    <filter class="solr.SnowballPorterFilterFactory" 
> protected="protwords_en.txt"/>
>  </analyzer>
>  .....
> </fieldType>
> 
> <field name="text_zh_cn" type="text_zh_cn" indexed="true" stored="false"
> multiValued="true"/>
> 
> 
> Best 
> Bing
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Language-specific-tokenizer-for-purpose-of-multilingual-search-in-single-core-solr-tp3742873p3742873.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message