lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] Unicode integration
Date Thu, 17 Nov 2011 14:29:41 GMT
On 17/11/2011 14:08, Robert Muir wrote:
> yeah, the problematic ones can be seen here:
> http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt
>
> # Derived Property: FC_NFKC_Closure
> #  Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b));
> #  Then if (c != b) add the mapping from a to c to the set of
> #  mappings that constitute the FC_NFKC_Closure list
>
> So from what I can tell at a glance: with the utf8proc algorithm, if
> you specify NFKC and casefolding, its not yet 'done'

I just verified that the output utf8proc produces with the options 
STABLE, COMPOSE, COMPAT, and CASEFOLD really matches the FC_NFKC 
mapping. See the test program at https://gist.github.com/1373256

This is because case folding is done together with the decomposition step.

I also think this would be a nice default for a search engine.

Nick

Mime
View raw message