lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
Date Thu, 14 May 2009 07:06:45 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1629:
----------------------------------

    Attachment: build-resources.patch

Here another try with Erik's suggestion:
I moved the <copy> task to the compile macro and extended the list of exclusions. With
some work and verbose=true, I added all "source" files to the exclusion (also .jj and so on).

Using this patch, you can compile Xiaoping Gao patch, add the resources to cn/ and cn/smart/hhmm/
and they appear in classpath for testing and the final jar file.

My problem with this is the messy exclusion list. During reading ANT docs, I dound out that
there is the possibility with the <copy> task to not stop on errors. The idea is now
again to put the data files into a maven-like resources folder and just copy them to the classpath
(if the folder does not exist, copy would simply do nothing).

I post a patch/test later.

> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
>                 Key: LUCENE-1629
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>         Environment: for java 1.5 or higher, lucene 2.4.1
>            Reporter: Xiaoping Gao
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch, build-resources.patch,
coredict.mem, LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's
called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)
  "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence
properly, or there will be mis-understandings everywhere in the index constructed by Lucene,
and the accuracy of the search engine will be affected seriously!
> Although there are two analyzer packages in apache repository which can handle Chinese:
ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters
as a single word, this is obviously not true in reality, also this strategy will increase
the index size and hurt the performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it
can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model
is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other
analyzer's is about 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it
to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message