lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
Date Sat, 16 May 2009 22:55:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710147#action_12710147
] 

Mingfai Ma commented on LUCENE-1629:
------------------------------------

i'm not sure if the character mapping is a feasible approach. The CC-CEDICT dictionary has
terms in both simplified and traditional chinese already and we need not to define any rule.
how much effort is needed to define all mapping rules for the simplified and traditional chinese
thesaurus?

Besides, simplified and traditional Chinese conversion is not as simple as mapping the code.
For the sample meaning, it may use different words in the SC and TC. 

if the approach accepted in this issue is ok, I just need to figure out how to convert CC-CEDICT
to the dct / mem format, and i suppose it is doable.

> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
>                 Key: LUCENE-1629
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>         Environment: for java 1.5 or higher, lucene 2.4.1
>            Reporter: Xiaoping Gao
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch,
build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch,
LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's
called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)
  "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence
properly, or there will be mis-understandings everywhere in the index constructed by Lucene,
and the accuracy of the search engine will be affected seriously!
> Although there are two analyzer packages in apache repository which can handle Chinese:
ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters
as a single word, this is obviously not true in reality, also this strategy will increase
the index size and hurt the performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it
can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model
is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other
analyzer's is about 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it
to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message