Wunder - are you aware of any free dictionaries for either C or J or K? When I dealt with
this in the past, I looked for something free, but found only commercial dictionaries.
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Walter Underwood <wunderwood@netflix.com>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 28, 2007 5:43:32 PM
Subject: Re: CJK Analyzers for Solr
With Ultraseek, we switched to a dictionary-based segmenter for Chinese
because the N-gram highlighting wasn't acceptable to our Chinese
customers.
I guess it is something to check for each application.
wunder
On 11/27/07 10:46 PM, "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
wrote:
> For what it's worth I worked on indexing and searching a *massive*
pile of
> data, a good portion of which was in CJ and some K. The n-gram
approach was
> used for all 3 languages and the quality of search results, including
> highlighting was evaluated and okay-ed by native speakers of these
languages.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Walter Underwood <wunderwood@netflix.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, November 27, 2007 2:41:38 PM
> Subject: Re: CJK Analyzers for Solr
>
> Dictionaries are surprisingly expensive to build and maintain and
> bi-gram is surprisingly effective for Chinese. See this paper:
>
> http://citeseer.ist.psu.edu/kwok97comparing.html
>
> I expect that n-gram indexing would be less effective for Japanese
> because it is an inflected language. Korean is even harder. It might
> work to break Korean into the phonetic subparts and use n-gram on
> those.
>
> You should not do term highlighting with any of the n-gram methods.
> The relevance can be very good, but the highlighting just looks dumb.
>
> wunder
>
> On 11/27/07 8:54 AM, "Eswar K" <kja.eswar@gmail.com> wrote:
>
>> Is there any specific reason why the CJK analyzers in Solr were
> chosen to be
>> n-gram based instead of it being a morphological analyzer which is
> kind of
>> implemented in Google as it considered to be more effective than the
> n-gram
>> ones?
>>
>> Regards,
>> Eswar
>>
>>
>>
>> On Nov 27, 2007 7:57 AM, Eswar K <kja.eswar@gmail.com> wrote:
>>
>>> thanks james...
>>>
>>> How much time does it take to index 18m docs?
>>>
>>> - Eswar
>>>
>>>
>>> On Nov 27, 2007 7:43 AM, James liu <liuping.james@gmail.com >
wrote:
>>>
>>>> i not use HYLANDA analyzer.
>>>>
>>>> i use je-analyzer and indexing at least 18m docs.
>>>>
>>>> i m sorry i only use chinese analyzer.
>>>>
>>>>
>>>> On Nov 27, 2007 10:01 AM, Eswar K <kja.eswar@gmail.com> wrote:
>>>>
>>>>> What is the performance of these CJK analyzers (one in lucene and
>>>> hylanda
>>>>> )?
>>>>> We would potentially be indexing millions of documents.
>>>>>
>>>>> James,
>>>>>
>>>>> We would have a look at hylanda too. What abt japanese and korean
>>>>> analyzers,
>>>>> any recommendations?
>>>>>
>>>>> - Eswar
>>>>>
>>>>> On Nov 27, 2007 7:21 AM, James liu <liuping.james@gmail.com>
> wrote:
>>>>>
>>>>>> I don't think NGram is good method for Chinese.
>>>>>>
>>>>>> CJKAnalyzer of Lucene is 2-Gram.
>>>>>>
>>>>>> Eswar K:
>>>>>> if it is chinese analyzer,,i recommend
> hylanda(www.hylanda.com),,,it
>>>> is
>>>>>> the best chinese analyzer and it not free.
>>>>>> if u wanna free chinese analyzer, maybe u can try je-analyzer.
> it
>>>> have
>>>>>> some problem when using it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
>>>> otis_gospodnetic@yahoo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Eswar,
>>>>>>>
>>>>>>> We've uses the NGram stuff that exists in Lucene's
>>>> contrib/analyzers
>>>>>>> instead of CJK. Doesn't that allow you to do everything that
> the
>>>>>> Chinese
>>>>>>> and CJK analyzers do? It's been a few months since I've looked
> at
>>>>>> Chinese
>>>>>>> and CJK Analzyers, so I could be off.
>>>>>>>
>>>>>>> Otis
>>>>>>>
>>>>>>> --
>>>>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>>>>
>>>>>>> ----- Original Message ----
>>>>>>> From: Eswar K <kja.eswar@gmail.com>
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Sent: Monday, November 26, 2007 8:30:52 AM
>>>>>>> Subject: CJK Analyzers for Solr
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Does Solr come with Language analyzers for CJK? If not, can you
>>>> please
>>>>>>> direct me to some good CJK analyzers?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Eswar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> regards
>>>>>> jl
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> regards
>>>> jl
>>>>
>>>
>>>
>
>
>
>
|