incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: [lucy-dev] Unicode integration
Date Thu, 17 Nov 2011 12:57:42 GMT
On Thu, Nov 17, 2011 at 7:45 AM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
> On 17/11/2011 13:37, Robert Muir wrote:
>>
>> The point of the derived property is that there are sneaky
>> interactions between these.
>
> Having a look at the utf8proc code, the function utf8proc_decompose_char
> calls itself recursively when substituting characters. So it looks like it
> does support NFKC_Casefold properly.
>
> Nick
>

I don't think so: it seems to only decompose the 'output' case folding
mapping. this is not enough.

If I remember, the problem is that normalization of course uses
context, so the algorithm must be done as stated in the standard:

 toNFKC_Casefold(X): Map each character C in X to NFKC_Casefold(C) and then
normalize the resulting string to NFC

doing the mappings: then normalizing the whole string.

in icu this is instead done as an additional normalization form, so
its single-pass/non-recursive there.

-- 
lucidimagination.com

Mime
View raw message