lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Poornima Jay <poornima...@rocketmail.com>
Subject Re: Korean Tokenizer in solr
Date Mon, 14 Jul 2014 09:38:28 GMT
When I am trying to index the below error comes

java.io.FileNotFoundException: /home/searchuser/multicore/apac_content/data/tlog/tlog.0000000000000000000
(No such file or directory)





On Monday, 14 July 2014 2:07 PM, Poornima Jay <poornimajay@rocketmail.com> wrote:
 


Yes, Below is my defined fieldtype

<fieldType name="text_match_phrase_cjk" class="solr.TextField" positionIncrementGap="100">
      <analyzer type ="index">
         <tokenizer class="solr.ICUTokenizerFactory"/>
         <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      </analyzer>
      <analyzer type ="query">
         <tokenizer class="solr.ICUTokenizerFactory"/>
         <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
      </analyzer>
   </fieldType>

Please correct me if I am doing anything wrong here

Regards,
Poornima



On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch <arafalov@gmail.com> wrote:



You sure, it's not a spelling error or something other weird like
that? Because Solr ships with that filter in it's example schema:
        <filter class="solr.CJKBigramFilterFactory"/>

So, you can compare what you are doing differently with that.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853



On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay
<poornimajay@rocketmail.com> wrote:
> I have upgrade the solr version to 4.8.1. But after making changes in the schema file
i am getting the below error
> Error instantiating class: 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 4.8.1. Do
I need to make any configuration changes to get this working.
>
> Please advice.
>
> Regards,
> Poornima
>
>
> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <arafalov@gmail.com> wrote:
>
>
>
> I would suggest you read through all 12 (?) articles in this series:
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> . It will probably lay out most of the issues for you.
>
> And if you are starting, I would really suggest using the latest Solr
> (4.9). A lot more people remember what the latest version has then
> what was in 3.6. And, as the series above will tell you, some relevant
> issues had been fixed in more recent Solr versions.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>
>
>
> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
> <poornimajay@rocketmail.com> wrote:
>> Till now I was thinking solr will support KoreanTokenizer. I haven't used any other
3rd party one.
>> Actually the issue i am facing is I need to integrate English, Chinese, Japanese
and Korean language search in a single site. Based on the user's selected language to search
the fields will be queried appropriately.
>>
>> I tried using cjk for all the 3 languages like below but only few search terms work
for Chinese and Japanese. nothing works for Korean.
>>
>> <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="10000"
autoGeneratePhraseQueries="false">
>>      <analyzer>
>>         <tokenizer class="solr.CJKTokenizerFactory" />
>>         <filter class="solr.CJKWidthFilterFactory"/>
>>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
>>         <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>>         <filter class="solr.ICUFoldingFilterFactory"/>
>>         <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true"
katakana="true" hangul="true" outputUnigrams="true" />
>>       </analyzer>
>>     </fieldtype>
>>
>> So i tried to implement individual fieldtype for each language as below
>>
>> Chinese
>>  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="1000"
autoGeneratePhraseQueries="false">
>>      <analyzer>
>>          <tokenizer class="solr.ICUTokenizerFactory"/>
>>            <filter class="solr.ICUFoldingFilterFactory"/>
>>            <filter class="solr.CJKWidthFilterFactory"/>
>>            <filter class="solr.CJKBigramFilterFactory"/>
>>        </analyzer>
>>     </fieldType>
>>
>> Japanese
>> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
>>    <analyzer>
>>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="stoptags_ja.txt"
/>
>>       <filter class="solr.CJKWidthFilterFactory"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt"
/>
>>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>    </analyzer>
>> </fieldType>
>>
>> Korean
>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>>       <analyzer type="index">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" hasCNoun="true" 
bigrammable="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" hasCNoun="false" 
bigrammable="false"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_kr.txt"/>
>>       </analyzer>
>>     </fieldType>
>>
>> I am really struck how to implement this. Please help me.
>>
>> Thanks,
>> Poornima
>>
>>
>>
>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <arafalov@gmail.com>
wrote:
>>
>>
>>
>> I don't think Solr ships with Korean Tokenizer, does it?
>>
>> If you are using a 3rd party one, you need to give full class name,
>> not just solr.Korean... And you need the library added in the lib
>> statement in solrconfig.xml (at least in Solr 4).
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
>> <poornimajay@rocketmail.com> wrote:
>>> I have defined the fieldtype inside the fields section.  When i checked the
error log i found the below error
>>>
>>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>>
>>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer
& filter list
>>>
>>>
>>> Do i need to add any libraries for koreanTokenizer?
>>>
>>> Regards,
>>> Poornima
>>>
>>>
>>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch <arafalov@gmail.com>
wrote:
>>>
>>>
>>>
>>> Double check your xml file that you don't - for example - define your
>>> fieldType outside of fields section. Or maybe you have exception
>>> earlier about some component in the type definition.
>>>
>>> This is not about Korean language, it seems. Something more
>>> fundamentally about XML config.
>>>
>>> Regards,
>>>    Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
>>>
>>>
>>>
>>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>>> <poornimajay@rocketmail.com> wrote:
>>>> Hi,
>>>>
>>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>>> as below in my schema file but the fieldtype is not working.
>>>>
>>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>>
>>>>       <analyzer type="index">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>>> hasCNoun="true"  bigrammable="true"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>       <analyzer type="query">
>>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>>> hasCNoun="false"  bigrammable="false"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_kr.txt"/>
>>>>       </analyzer>
>>>>     </fieldType>
>>>>
>>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>>> 'text_kr' specified on field product_name_kr
>>>>
>>>> Regards,
>>>> Poornima
>>>>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message