>Wunder - are you aware of any free dictionaries=20
>for either C or J or K? When I dealt with this=20
>in the past, I looked for something free, but=20
>found only commercial dictionaries.
I would use data files from:
http://ftp.monash.edu.au/pub/nihongo/00INDEX.html
-- Ken
>Sematext -- http://sematext.com/ -- Lucene -=20
>Solr - Nutch ----- Original Message ---- From:=20
>Walter Underwood <wunderwood@netflix.com> To:=20
>solr-user@lucene.apache.org Sent: Wednesday,=20
>November 28, 2007 5:43:32 PM Subject: Re: CJK=20
>Analyzers for Solr With Ultraseek, we switched=20
>to a dictionary-based segmenter for Chinese=20
>because the N-gram highlighting wasn't=20
>acceptable to our Chinese customers. I guess it=20
>is something to check for each application.=20
>wunder On 11/27/07 10:46 PM, "Otis Gospodnetic"=20
><otis_gospodnetic@yahoo.com> wrote: > For what=20
>it's worth I worked on indexing and searching a=20
>*massive* pile of > data, a good portion of=20
>which was in CJ and some K. The n-gram approach=20
>was > used for all 3 languages and the quality=20
>of search results, including > highlighting was=20
>evaluated and okay-ed by native speakers of=20
>these languages. > > Otis > -- > Sematext --=20
>http://sematext.com/ -- Lucene - Solr -=20
>Nutch > > ----- Original Message ---- > From:=20
>Walter Underwood <wunderwood@netflix.com> > To:=20
>solr-user@lucene.apache.org > Sent: Tuesday,=20
>November 27, 2007 2:41:38 PM > Subject: Re: CJK=20
>Analyzers for Solr > > Dictionaries are=20
>surprisingly expensive to build and maintain=20
>and > bi-gram is surprisingly effective for=20
>Chinese. See this paper: > >=20
>http://citeseer.ist.psu.edu/kwok97comparing.html > >=20
>I expect that n-gram indexing would be less=20
>effective for Japanese > because it is an=20
>inflected language. Korean is even harder. It=20
>might > work to break Korean into the phonetic=20
>subparts and use n-gram on > those. > > You=20
>should not do term highlighting with any of the=20
>n-gram methods. > The relevance can be very=20
>good, but the highlighting just looks dumb. > >=20
>wunder > > On 11/27/07 8:54 AM, "Eswar K"=20
><kja.eswar@gmail.com> wrote: > >> Is there any=20
>specific reason why the CJK analyzers in Solr=20
>were > chosen to be >> n-gram based instead of=20
>it being a morphological analyzer which is >=20
>kind of >> implemented in Google as it=20
>considered to be more effective than the >=20
>n-gram >> ones? >> >> Regards, >>=20
>Eswar >> >> >> >> On Nov 27, 2007 7:57 AM, Eswar=20
>K <kja.eswar@gmail.com> wrote: >> >>> thanks=20
>james... >>> >>> How much time does it take to=20
>index 18m docs? >>> >>> - Eswar >>> >>> >>>
On=20
>Nov 27, 2007 7:43 AM, James liu=20
><liuping.james@gmail.com > wrote: >>> >>>> i not=20
>use HYLANDA analyzer. >>>> >>>> i use=20
>je-analyzer and indexing at least 18m=20
>docs. >>>> >>>> i m sorry i only use chinese=20
>analyzer. >>>> >>>> >>>> On Nov 27, 2007 10:01=20
>AM, Eswar K <kja.eswar@gmail.com>=20
>wrote: >>>> >>>>> What is the performance of=20
>these CJK analyzers (one in lucene and >>>>=20
>hylanda >>>>> )? >>>>> We would potentially be=20
>indexing millions of documents. >>>>> >>>>>=20
>James, >>>>> >>>>> We would have a look at=20
>hylanda too. What abt japanese and korean >>>>>=20
>analyzers, >>>>> any=20
>recommendations? >>>>> >>>>> - Eswar >>>>>
>>>>>=20
>On Nov 27, 2007 7:21 AM, James liu=20
><liuping.james@gmail.com> > wrote: >>>>> >>>>>>=20
>I don't think NGram is good method for=20
>Chinese. >>>>>> >>>>>> CJKAnalyzer of Lucene is=20
>2-Gram. >>>>>> >>>>>> Eswar K: >>>>>>
if it is=20
>chinese analyzer,,i recommend >=20
>hylanda=C5iwww.hylanda.com=C5j,,,it >>>> is >>>>>>=20
>the best chinese analyzer and it not=20
>free. >>>>>> if u wanna free chinese analyzer,=20
>maybe u can try je-analyzer. > it >>>>=20
>have >>>>>> some problem when using=20
>it. >>>>>> >>>>>> >>>>>> >>>>>>
On Nov 27, 2007=20
>5:56 AM, Otis Gospodnetic < >>>>=20
>otis_gospodnetic@yahoo.com> >>>>>>=20
>wrote: >>>>>> >>>>>>> Eswar, >>>>>>>
>>>>>>>=20
>We've uses the NGram stuff that exists in=20
>Lucene's >>>> contrib/analyzers >>>>>>> instead=20
>of CJK. Doesn't that allow you to do everything=20
>that > the >>>>>> Chinese >>>>>>> and CJK=20
>analyzers do? It's been a few months since I've=20
>looked > at >>>>>> Chinese >>>>>>> and CJK=20
>Analzyers, so I could be off. >>>>>>> >>>>>>>=20
>Otis >>>>>>> >>>>>>> -- >>>>>>>
Sematext --=20
>http://sematext.com/ -- Lucene - Solr -=20
>Nutch >>>>>>> >>>>>>> ----- Original Message=20
>---- >>>>>>> From: Eswar K=20
><kja.eswar@gmail.com> >>>>>>> To:=20
>solr-user@lucene.apache.org >>>>>>> Sent:=20
>Monday, November 26, 2007 8:30:52 AM >>>>>>>=20
>Subject: CJK Analyzers for Solr >>>>>>> >>>>>>>=20
>Hi, >>>>>>> >>>>>>> Does Solr come with Language=20
>analyzers for CJK? If not, can you >>>>=20
>please >>>>>>> direct me to some good CJK=20
>analyzers? >>>>>>> >>>>>>> Regards, >>>>>>>=20
>Eswar >>>>>>> >>>>>>> >>>>>>>
>>>>>>> >>>>>> >>>>>> >>>>>>=20
>-- >>>>>> regards >>>>>>=20
>jl >>>>>> >>>>> >>>> >>>> >>>>
>>>> -- >>>>=20
>regards >>>> jl >>>> >>> >>> > > >
>
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
|