lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese
Date Fri, 08 May 2009 02:42:24 GMT
I'd prefer it to stay 1.4 for now and would be willing to make the  
change, if needed.

-- DM

On May 7, 2009, at 3:04 PM, Michael McCandless (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707042

> #action_12707042 ]
>
> Michael McCandless commented on LUCENE-1629:
> --------------------------------------------
>
> bq. There is lots of code depending on Java 1.5, I use enum,  
> generalization frequently. Because I saw these points on apache wiki:
>
> Well... "in general" contrib packages can be 1.5, but the analyzers  
> contrib package is widely used, and is not 1.5 now, so it's a  
> biggish change to force it to 1.5 with this.  We should at least  
> separate discuss in on java-dev if we want to consider allowing 1.5  
> code into contrib-analyzers.
>
> We could hold off on committing this until 3.0?
>
>> contrib intelligent Analyzer for Chinese
>> ----------------------------------------
>>
>>                Key: LUCENE-1629
>>                URL: https://issues.apache.org/jira/browse/LUCENE-1629
>>            Project: Lucene - Java
>>         Issue Type: Improvement
>>         Components: contrib/analyzers
>>   Affects Versions: 2.4.1
>>        Environment: for java 1.5 or higher, lucene 2.4.1
>>           Reporter: Xiaoping Gao
>>        Attachments: analysis-data.zip, LUCENE-1629.patch
>>
>>
>> I wrote a Analyzer for apache lucene for analyzing sentences in  
>> Chinese language. it's called "imdict-chinese-analyzer", the  
>> project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/
>> In Chinese, "我是中国人"(I am Chinese), should be tokenized as  
>> "我"(I)   "是"(am)   "中国人"(Chinese), not "我" "是中" "国 
>> 人". So the analyzer must handle each sentence properly, or there  
>> will be mis-understandings everywhere in the index constructed by  
>> Lucene, and the accuracy of the search engine will be affected  
>> seriously!
>> Although there are two analyzer packages in apache repository which  
>> can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each  
>> character or every two adjoining characters as a single word, this  
>> is obviously not true in reality, also this strategy will increase  
>> the index size and hurt the performance baddly.
>> The algorithm of imdict-chinese-analyzer is based on Hidden Markov  
>> Model (HMM), so it can tokenize chinese sentence in a really  
>> intelligent way. Tokenizaion accuracy of this model is above 90%  
>> according to the paper "HHMM-based Chinese Lexical analyzer  
>> ICTCLAL" while other analyzer's is about 60%.
>> As imdict-chinese-analyzer is a really fast and intelligent. I want  
>> to contribute it to the apache lucene repository.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message