lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gili Nachum <gilinac...@gmail.com>
Subject Re: Lightweight detection of whether a keyword is CJK or not (language detection)
Date Sun, 10 Mar 2013 09:19:11 GMT
Answering myself for next generations' sake.
Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS does the job.

Example:

import junit.framework.Assert;
import org.junit.Test;

public class DetectCJK {

    @Test
    public void test1() {
        Assert.assertEquals(Character.UnicodeBlock.BASIC_LATIN,
Character.UnicodeBlock.of('a'));
        Assert.assertEquals(Character.UnicodeBlock.HEBREW,
Character.UnicodeBlock.of('א'));
        Assert.assertEquals("Traditional Chinese: Electricity",
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of('電'));
        Assert.assertEquals("Simplified Chinese: Electricity",
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of('电'));
        Assert.assertEquals("Simplified Chinese: Japanese",
                Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of('電'));

        String chineseWritingStr = "漢字/汉字";
        int length = chineseWritingStr.codePointCount(0,
chineseWritingStr.length()-1);
        for (int i=0; i<length; i++) {
            int codePoint = chineseWritingStr.codePointAt(0);
            Assert.assertEquals("Chinese: Chinese writing",
                    Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS,
Character.UnicodeBlock.of(codePoint));
        }
    }
}


On Fri, Feb 22, 2013 at 12:51 AM, Gili Nachum <gilinachum@gmail.com> wrote:

> Hello, Is there anything in the Lucene core/contrib that could help detect
> if a keyword is CJK or not?
> I was thinking that an okay heuristic might be to inspect if the keyword's
> characters unicode value is within CJK ranges. Anything that does that?
>
> I'm seeing really bad performance when users query for keywords with a
> wildcard (say: "abc*") . Therefore, as a defensive measure, I plan to
> restrict wildcard queries to have a minimum of 4 characters (e.g., reject
> "abc*" allow "abcd*").
> However, for CJK keywords, I would like to make an exception, since in CJK
> just one or two letters stand for a distinct word (I'm okay that some CJK
> characters are not words, but are phonetic in nature).
>
> Thanks.
> Gili.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message