Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 24780 invoked from network); 23 Nov 2010 11:07:46 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Nov 2010 11:07:46 -0000 Received: (qmail 94540 invoked by uid 500); 23 Nov 2010 11:08:14 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 94389 invoked by uid 500); 23 Nov 2010 11:08:14 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 94381 invoked by uid 99); 23 Nov 2010 11:08:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Nov 2010 11:08:13 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bw0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Nov 2010 11:08:07 +0000 Received: by bwz9 with SMTP id 9so7925898bwz.35 for ; Tue, 23 Nov 2010 03:07:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=Z0SAbGCuDNhlC5zvTU8v20WqlrBn5ZsJwjrI6r2AusY=; b=Oo/odECIpt2n/fInmNusOJX9aT3u1hIqG7WuX/fLjHsEPkGEdORIKgvsYWe345q7UQ 3mY8wyI5LeMmTyglRKFLDa5+YlOBpEnKm9bxakNMrDLNMkinOAsWCP/OZTfQYJjyDCk1 feERycIFlUAmmSBpzvLbNSdlu9XHTZ8hDlvpU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=T3P95nk88fTsWcFqPMImLW8ZGx6ehTh+kKtGhzTD6MoZwie5h5iIGZoKVa/7HJ3w6I 75MeoeOD22ofvA38iiqKNLbLlSb5LDiP9tyX4H4LrSxLI18esWQjl+pswfJkNq8PYVnH 2Z1GneITUJCS+cSpxC7M1unrJzpodfO9aT5ww= Received: by 10.204.69.74 with SMTP id y10mr6310537bki.79.1290510466547; Tue, 23 Nov 2010 03:07:46 -0800 (PST) MIME-Version: 1.0 Received: by 10.204.77.201 with HTTP; Tue, 23 Nov 2010 03:07:25 -0800 (PST) In-Reply-To: <47316FE3F6BA0D4DADF996635125526809F6463198@ITCS-ECLS-1-VS3.adsroot.itcs.umich.edu> References: <47316FE3F6BA0D4DADF996635125526809F6463198@ITCS-ECLS-1-VS3.adsroot.itcs.umich.edu> From: Robert Muir Date: Tue, 23 Nov 2010 06:07:25 -0500 Message-ID: Subject: Re: ICUTokenizer and CJK To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org On Mon, Nov 22, 2010 at 6:50 PM, Burton-West, Tom wrote: > Hi all, > > I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it does with CJK, which for C and J appears to be breaking into unigrams. Is this correct? > The han ideographs are segmented into unigram (this is the uax#29 default behavior). I don't know off the top of my head what the rules are for japanese kana... --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org