lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject [contrib]: CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment
Date Mon, 13 May 2002 14:55:21 GMT

/**
 * CJKTokenizer was modified from StopTokenizer which does a decent job for
most European
 * languages. and it perferm other token method for double-byte Characters:
the token will
 * return at each two charactors with overlap match.
 * Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4"
 * it also need filter filter zero length token ""
 *
 * for more info on Asia language(Chinese Japanese Korean) text
segmentation:
 * http://www.google.com/search?q=overlap+match+chinese+segment
 * for Digit: the prefix digit will token: "3dmax"=>"3" "dmax"; "U2"=>"u2"
 * for Punc: '_' will token as a letter, '+' '#' will token as a digit
 *
 * @author    Che, Dong chedong@bigfoot.com
 * @version   $Id$
*/

 CJKTokenizer.java

Mime
View raw message