lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: [lucy-dev] Unicode integration
Date Tue, 22 Nov 2011 12:20:57 GMT
On Thu, Nov 17, 2011 at 10:52 PM, Marvin Humphrey
<> wrote:
> OK, I remain at least academically interested in what sort of performance
> advantages 'simple' case folding affords us, and at what penalty in terms of
> relevancy.

I think it depends how its implemented, I'm not sure there is really a
performance advantage to the simpler one. In ICU at least, the
recursive part of nfkc_cf is computed up-front, into the data files,
and you get normalization+case folding at runtime in one-pass (versus
utf8proc's multiple passes, and its not clear all the corner cases are
working there)

As far as relevance, I think realistically only german users (ß/SS) or
anyone with ancient greek would care if you cheated and used the
simple one instead, especially if you are already normalizing anyway.

But that was just my point: if you are normalizing anyway, why not
just choose a normalization form that also does the case folding too.


View raw message