lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject ICUTokenizer and CJK
Date Mon, 22 Nov 2010 23:50:55 GMT
Hi all,

I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar, Khmer
word breaking but no details in the javadoc about what it does with CJK, which for C and J
appears to be breaking into unigrams. Is this correct?


Tom


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message