lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: [lucy-dev] Unicode integration
Date Thu, 17 Nov 2011 13:08:44 GMT
On Thu, Nov 17, 2011 at 7:45 AM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
> On 17/11/2011 13:37, Robert Muir wrote:
>>
>> The point of the derived property is that there are sneaky
>> interactions between these.
>
> Having a look at the utf8proc code, the function utf8proc_decompose_char
> calls itself recursively when substituting characters. So it looks like it
> does support NFKC_Casefold properly.

yeah, the problematic ones can be seen here:
http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt

# Derived Property: FC_NFKC_Closure
#  Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b));
#  Then if (c != b) add the mapping from a to c to the set of
#  mappings that constitute the FC_NFKC_Closure list

So from what I can tell at a glance: with the utf8proc algorithm, if
you specify NFKC and casefolding, its not yet 'done'

-- 
lucidimagination.com

Mime
View raw message