lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gili Nachum <gilinac...@gmail.com>
Subject Lightweight detection of whether a keyword is CJK or not (language detection)
Date Thu, 21 Feb 2013 22:51:09 GMT
Hello, Is there anything in the Lucene core/contrib that could help detect
if a keyword is CJK or not?
I was thinking that an okay heuristic might be to inspect if the keyword's
characters unicode value is within CJK ranges. Anything that does that?

I'm seeing really bad performance when users query for keywords with a
wildcard (say: "abc*") . Therefore, as a defensive measure, I plan to
restrict wildcard queries to have a minimum of 4 characters (e.g., reject
"abc*" allow "abcd*").
However, for CJK keywords, I would like to make an exception, since in CJK
just one or two letters stand for a distinct word (I'm okay that some CJK
characters are not words, but are phonetic in nature).

Thanks.
Gili.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message