lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junshik, Jeon" <lo...@nextel.co.kr>
Subject Korean character set in analysis
Date Thu, 29 Nov 2001 05:15:45 GMT
Hello,

I've been testing lucene indexing and searching for Korean Language documents.
But, currently not support korean character set...

So, I've changed some codes to work with korean character set.


in "jakarta-lucene\src\java\org\apache\lucene\analysis\standard\StandardTokenizer.jj" file.

JavaCC option part..
-----------------------------------------------------------------
options {
  STATIC = false;
//IGNORE_CASE = true;
//BUILD_PARSER = false;
  UNICODE_INPUT = true; // <== changes : uncomment for korean character set
  USER_CHAR_STREAM = true;
  OPTIMIZE_TOKEN_MANAGER = true;
//DEBUG_TOKEN_MANAGER = true;
}

in TOKEN 
-----------------------------------------------------------------
| < #LETTER:					  // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "\u3040"-"\u318f",
       "\u3300"-"\u337f",
       "\u3400"-"\u3d2d",
       "\u4e00"-"\u9fff",
       "\uac00"-"\ud7a3",   // <== changes : added.. ( korean character set in UNICODE
)
       "\uf900"-"\ufaff"
      ]
  >

I hope these changes are added to CVS repository..


Another question is how to analysis compound words.

Compound word consist of nouns. I want to index, every nouns in compounds word after analysis.
but current TokenStream class has only "public Token next()" method.

If you could let me know how to solve it?

Regards,

Junshik, Jeon (locus@nextel.co.kr)
Mime
View raw message