Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <702274339.1242400611512.JavaMail.jira@brutus>
Date: Fri, 15 May 2009 08:16:51 -0700 (PDT)
From: "Xiaoping Gao (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for
 Chinese
In-Reply-To: <445685380.1241412690530.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-1629?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D127=
09867#action_12709867 ]=20

Xiaoping Gao commented on LUCENE-1629:
--------------------------------------

Hello Mingfai!

coredict.mem is converted from coredict.dct which come from ICTCLAS1.0, =20
neither 2008 nor 2009.
The author authorized me to release just the lexical dictionary from =20
ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of =20
ictclas2008~2009.
As far as I know, coredict.dct just contain GB2312 characters, so it cannot=
 =20
support Big5.

I think we should find the proper big5 dictionary first, then I will help =
=20
you to convert to dct file.


On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@apache.org> wrote:


> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
>                 Key: LUCENE-1629
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>         Environment: for java 1.5 or higher, lucene 2.4.1
>            Reporter: Xiaoping Gao
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: analysis-data.zip, bigramdict.mem, build-resources-w=
ith-folder.patch, build-resources.patch, build-resources.patch, coredict.me=
m, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese l=
anguage. it's called "imdict-chinese-analyzer", the project on google code =
is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "=E6=88=91=E6=98=AF=E4=B8=AD=E5=9B=BD=E4=BA=BA"(I am Chinese)=
, should be tokenized as "=E6=88=91"(I)   "=E6=98=AF"(am)   "=E4=B8=AD=E5=
=9B=BD=E4=BA=BA"(Chinese), not "=E6=88=91" "=E6=98=AF=E4=B8=AD" "=E5=9B=BD=
=E4=BA=BA". So the analyzer must handle each sentence properly, or there wi=
ll be mis-understandings everywhere in the index constructed by Lucene, and=
 the accuracy of the search engine will be affected seriously!
> Although there are two analyzer packages in apache repository which can h=
andle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or=
 every two adjoining characters as a single word, this is obviously not tru=
e in reality, also this strategy will increase the index size and hurt the =
performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model =
(HMM), so it can tokenize chinese sentence in a really intelligent way. Tok=
enizaion accuracy of this model is above 90% according to the paper "HHMM-b=
ased Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to co=
ntribute it to the apache lucene repository.

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org