lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 18933] - Add support for Chinese, Japanese, and Korean to the core build.
Date Mon, 29 Sep 2003 06:57:49 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18933

Add support for Chinese, Japanese, and Korean to the core build.





------- Additional Comments From jshin@jtan.com  2003-09-29 06:57 -------
I just hit upon this bug purely by chance and not until a moment ago did I know
about lucene so that some or all of the following may not be relevant, for which
I apologize to you in advance. To read my comment, you have to set the character
encoding of your browser to UTF-8 because it inclues some Korean characters in
UTF-8.

Korean is NOT like Chinese and Japanese. (Modern) Korean texts do use spaces
between words. However, the Korean orthographic standard is rather 'liberal' in
*allowing* (the norm is to add spaces between nouns) multiple _nouns_ to be put
together without spaces between them when they are used to refer to a single
'entity'/'concept'.  Therefore, Korean texts are full of 'megawords' a la German
compound words. For instance, in German, 'quantum mechanics' is
'Quantenmechaniker'. In Korean, it's either '양자 역학' (the norm: with a space:
English-like) or '양자역학'(more widely used. German-like). 

The following comment may be off-topic here.
What's more relevant to Korean tokenizer (and Japanese tokenizer as well.
because both languages are aggultinating languages. On the other hand, Chinese
is an isolating language) is the ability to  split apart word stems from
prefices/sufficies  that play a various gramatical roles (tense, honorific form,
mode, and so forth)  and particles(denoting subject, object,etc). In many
applications, gramatically-functional prefices/suffices/particles/words have to
be excluded from indexing because they are not 'content-bearing'. Basis
Technology's Korean analyzer (www.basistech.com) is quite good (not perfect) at
this.

Mime
View raw message