lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From waynelam <wayne...@ln.edu.hk>
Subject Re: Searching of Chinese characters and English
Date Thu, 06 Sep 2012 04:45:53 GMT
Thank you Lance.
I just found out the problem, in case somebody came across this.
It turn out to be the problem that tomcat is not accepting UTF-8 in URL 
by default.

http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

I have no idea why it is the case but after i follow the instruction in 
the document above.

Problem solved!!

Thanks so much for your help!


Wayne


On 6/9/2012 11:19, Lance Norskog wrote:
> I believe that you should remove the Analyzer class name from the field type. I think
it overrides the stacks of tokenizer/tokenfilter. Other <fieldType> declarations do
not have an Analyzer class and Tokenizers.
>   <analyzer type="index" class="org.apache.lucene.analysis.cjk.CJKAnalyzer">
> should be:
>   <analyzer type="index">
>
> This may not help with your searching problem.
>
> ----- Original Message -----
> | From: "waynelam" <waynelam@ln.edu.hk>
> | To: solr-user@lucene.apache.org
> | Sent: Wednesday, September 5, 2012 8:07:36 PM
> | Subject: Re: Searching of Chinese characters and English
> |
> | Any thoughts?
> |
> | It is weird, i can see the words are cutting correctly in Field
> | Analysis. I checked almost every website that they are telling either
> | CJKAnalyzer, IKAnalyzer or SmartChineseAnalyzer. But if i can see the
> | words are cutting then it should not be the problem of settings of
> | different Analyzer. Am I correct?
> |
> | Anyone have an idea or hints?
> |
> | Thanks so much
> |
> | Wayne
> |
> |
> |
> | On 4/9/2012 13:03, waynelam wrote:
> | > Hi all,
> | >
> | > I tried to modified the schema.xml and solrconfig.xml come with
> | > Drupal
> | > "search_api_solr" modules. I tried to modified it so that it is
> | > suitable for an CJK environment. I can see Chinese words cut up
> | > each 2
> | > words in "Field Analysis". If i use the following query
> | >
> | > my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title
> | >
> | >
> | > I can see it returning results. The problem is when i change the
> | > search keywords for one of my field (e.g. t_title) to Chinese
> | > characters. It always shows
> | >
> | > <result name="response" numFound="0" start="0"/>
> | >
> | > in the results. It is strange because if a title contains both
> | > chinese
> | > and english (e.g. testing ??), when i search just the english part
> | > (e.g. fq=t_title:"testing"), i can find the result perfectly. It
> | > just
> | > happened to be problem when searching chinese characters.
> | >
> | >
> | > Much appreciated if you guys can show me which part i did wrong.
> | >
> | > Thanks
> | >
> | > Wayne
> | >
> | > *My Settings:*
> | > Java : 1.6.0_24
> | > Solr : version 3.6.1
> | > tomcat: version 6.0.35
> | >
> | > *My schema.xml* (i highlighted the place i changed from default)
> | >
> | > *<fieldType name="text" class="solr.TextField" indexed="true"
> | > stored="true" multiValued="true">**
> | > **      <analyzer type="index"
> | > class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
> | > **        <tokenizer
> | > class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
> | > **        <filter class="solr.WordDelimiterFilterFactory"
> | > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> | > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>**
> | > **        <filter class="solr.LowerCaseFilterFactory"/>**
> | > **        <filter class="solr.SnowballPorterFilterFactory"
> | > language="English" protected="protwords.txt"/>**
> | > **        <filter
> | > class="solr.RemoveDuplicatesTokenFilterFactory"/>**
> | > **        <filter class="schema.UnicodeNormalizationFilterFactory"
> | > version="icu4j" composed="false" remove_diacritics="true"
> | > remove_modifiers="true" fold="true"/>**
> | > **        <filter class="solr.ISOLatin1AccentFilterFactory"/>**
> | > **      </analyzer>**
> | > **      <analyzer type="query"
> | > class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
> | > **        <tokenizer
> | > class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
> | > **        <filter class="solr.WordDelimiterFilterFactory"
> | > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> | > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>**
> | > **        <filter class="solr.LowerCaseFilterFactory"/>**
> | > **        <filter class="solr.SnowballPorterFilterFactory"
> | > language="English" protected="protwords.txt"/>**
> | > **        <filter
> | > class="solr.RemoveDuplicatesTokenFilterFactory"/>**
> | > **        <filter class="schema.UnicodeNormalizationFilterFactory"
> | > version="icu4j" composed="false" remove_diacritics="true"
> | > remove_modifiers="true" fold="true"/>**
> | > **        <filter class="solr.ISOLatin1AccentFilterFactory"/>**
> | > **      </analyzer>**
> | > **    </fieldType>*
> | >
> | >     <fieldType name="sortString" class="solr.TextField"
> | >     indexed="true"
> | > stored="true" sortMissingLast="true" omitNorms="true">
> | >       <analyzer>
> | >
> | >         <tokenizer class="solr.KeywordTokenizerFactory"/>
> | >
> | >         <filter class="solr.LowerCaseFilterFactory" />
> | >         <filter class="solr.TrimFilterFactory" />
> | >       </analyzer>
> | >     </fieldType>
> | >
> | >     <fieldType name="rand" class="solr.RandomSortField"
> | >     indexed="true" />
> | >
> | >     <fieldtype name="ignored" stored="true" indexed="false"
> | > class="solr.StrField" />
> | >  </types>
> | >  <fields>
> | >
> | >    <field name="id"       type="string" indexed="true"
> | >    stored="true"
> | > required="true" />
> | >    <field name="item_id"  type="string" indexed="true"
> | >    stored="true"
> | > required="true" />
> | >    <field name="index_id" type="string" indexed="true"
> | >    stored="true"
> | > required="true" />
> | >
> | >    <copyField source="item_id" dest="ss_search_api_id" />
> | >    <field name="spell" type="textSpell" indexed="true"
> | >    stored="true"
> | > multiValued="true"/>
> | >    <copyField source="t_*" dest="spell"/>
> | >
> | > *<field name="t_title" type="text" indexed="true" stored="true"
> | > autoGeneratePhraseQueries="false"/>*
> | >    <dynamicField name="t_*" type="text" termVectors="true" />
> | >    <dynamicField name="ss_*" type="sortString" multiValued="false"
> | > termVectors="true" />
> | >    <dynamicField name="sm_*" type="sortString" multiValued="true"
> | > termVectors="true" />
> | >    <dynamicField name="is_*" type="tlong" multiValued="false"
> | > termVectors="true" />
> | >    <dynamicField name="im_*" type="long" multiValued="true"
> | > termVectors="true" />
> | >    <dynamicField name="fs_*" type="tdouble" multiValued="false"
> | > termVectors="true" />
> | >    <dynamicField name="fm_*" type="tdouble" multiValued="true"
> | > termVectors="true" />
> | >    <dynamicField name="ds_*" type="tdate" multiValued="false"
> | > termVectors="true" />
> | >    <dynamicField name="dm_*" type="tdate" multiValued="true"
> | > termVectors="true" />
> | >    <dynamicField name="bs_*" type="boolean" multiValued="false"
> | > termVectors="true" />
> | >    <dynamicField name="bm_*" type="boolean" multiValued="true"
> | > termVectors="true" />
> | >    <dynamicField name="f_ss_*" type="string" multiValued="false"
> | > termVectors="true" />
> | >    <dynamicField name="f_sm_*" type="string" multiValued="true"
> | > termVectors="true" />
> | >    <copyField source="ss_*" dest="f_ss_*" />
> | >    <copyField source="sm_*" dest="f_sm_*" />
> | >    <dynamicField name="*" type="ignored" multiValued="true" />
> | >  </fields>
> | >
> | >  <uniqueKey>id</uniqueKey>
> | >  <solrQueryParser defaultOperator="AND"/>
> | >
> | > </schema>
> | >
> |
> |
> | --
> | -----------------------------------------
> | Wayne Lam
> | Assistant Librarian II
> | Systems Development & Support
> | Fong Sum Wood Library
> | Lingnan University
> | 8 Castle Peak Road
> | Tuen Mun, New Territories
> | Hong Kong SAR
> | China
> | Phone:   +852 26168576
> | Email:   waynelam@ln.edu.hk
> | Website: http://www.library.ln.edu.hk
> |
> |


-- 
-----------------------------------------
Wayne Lam
Assistant Librarian II
Systems Development & Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168576
Email:   waynelam@ln.edu.hk
Website: http://www.library.ln.edu.hk


Mime
View raw message