lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: [lucy-dev] Unicode integration
Date Thu, 17 Nov 2011 12:57:42 GMT
On Thu, Nov 17, 2011 at 7:45 AM, Nick Wellnhofer <> wrote:
> On 17/11/2011 13:37, Robert Muir wrote:
>> The point of the derived property is that there are sneaky
>> interactions between these.
> Having a look at the utf8proc code, the function utf8proc_decompose_char
> calls itself recursively when substituting characters. So it looks like it
> does support NFKC_Casefold properly.
> Nick

I don't think so: it seems to only decompose the 'output' case folding
mapping. this is not enough.

If I remember, the problem is that normalization of course uses
context, so the algorithm must be done as stated in the standard:

 toNFKC_Casefold(X): Map each character C in X to NFKC_Casefold(C) and then
normalize the resulting string to NFC

doing the mappings: then normalizing the whole string.

in icu this is instead done as an additional normalization form, so
its single-pass/non-recursive there.


View raw message