Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EEEAD72EF for ; Thu, 17 Nov 2011 13:09:30 +0000 (UTC) Received: (qmail 68381 invoked by uid 500); 17 Nov 2011 13:09:30 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 68331 invoked by uid 500); 17 Nov 2011 13:09:30 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 68318 invoked by uid 99); 17 Nov 2011 13:09:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2011 13:09:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates 209.85.214.47 as permitted sender) Received: from [209.85.214.47] (HELO mail-bw0-f47.google.com) (209.85.214.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2011 13:09:26 +0000 Received: by bkbzs2 with SMTP id zs2so1931138bkb.6 for ; Thu, 17 Nov 2011 05:09:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=d+a2dfe+tfr+wy/jghQh5MYshwhJ2Eb+Zh4ClazPsHM=; b=UH1O5gXIJpt+PatJcBdoYSLw0BaMGXRiAZIiNI9yR8si7CuvrOLcye5EwGR4h3J65M F0OzokX+3M4eROZPpFSwnp5A3cvv2pSOfPIAND7FZJdESBBdwZFJhbA2mhzljxm0TtHD VIO/Blticj4h+/0Fw0/BiLGd7O8ZUcpjnO6Bs= Received: by 10.204.145.130 with SMTP id d2mr33889620bkv.78.1321535345207; Thu, 17 Nov 2011 05:09:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.62.207 with HTTP; Thu, 17 Nov 2011 05:08:44 -0800 (PST) In-Reply-To: <4EC501FD.80909@aevum.de> References: <4EC161D0.1060103@aevum.de> <20111114212215.GA26256@rectangular.com> <4EC1C342.7080401@aevum.de> <20111115042209.GA27084@rectangular.com> <4EC2D0E5.10909@aevum.de> <20111116034932.GA10681@rectangular.com> <4EC43816.1070107@aevum.de> <4EC4FE6B.5010104@aevum.de> <4EC501FD.80909@aevum.de> From: Robert Muir Date: Thu, 17 Nov 2011 08:08:44 -0500 Message-ID: To: lucy-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Subject: Re: [lucy-dev] Unicode integration On Thu, Nov 17, 2011 at 7:45 AM, Nick Wellnhofer wrote: > On 17/11/2011 13:37, Robert Muir wrote: >> >> The point of the derived property is that there are sneaky >> interactions between these. > > Having a look at the utf8proc code, the function utf8proc_decompose_char > calls itself recursively when substituting characters. So it looks like it > does support NFKC_Casefold properly. yeah, the problematic ones can be seen here: http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt # Derived Property: FC_NFKC_Closure # Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b)); # Then if (c != b) add the mapping from a to c to the set of # mappings that constitute the FC_NFKC_Closure list So from what I can tell at a glance: with the utf8proc algorithm, if you specify NFKC and casefolding, its not yet 'done' -- lucidimagination.com