lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolgoo Kang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-461) StandardTokenizer splitting all of Korean words into separate characters
Date Tue, 08 Nov 2005 06:59:19 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-461?page=all ]

Cheolgoo Kang updated LUCENE-461:
---------------------------------

    Attachment: StandardTokenizer_KoreanWord.patch
                TestStandardAnalyzer_KoreanWord.patch

Here are patches to preserve one Korean word not to be separated into each characters. The
TestStandardAnalyzer test case attached has passed with StandardTokenizer with patch applied.

> StandardTokenizer splitting all of Korean words into separate characters
> ------------------------------------------------------------------------
>
>          Key: LUCENE-461
>          URL: http://issues.apache.org/jira/browse/LUCENE-461
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>  Environment: Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.
>     Reporter: Cheolgoo Kang
>     Priority: Minor
>  Attachments: StandardTokenizer_KoreanWord.patch, TestStandardAnalyzer_KoreanWord.patch
>
> StandardTokenizer splits all those Korean words inth separate character tokens. For example,
"?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five
tokens of "?", "?", "?", "?", "?".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message