lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] Unicode integration
Date Thu, 17 Nov 2011 13:05:44 GMT
On 17/11/2011 06:09, Marvin Humphrey wrote:
> I wonder: does either "common" or "simple" Unicode case folding preserve a
> one-to-one relationship between num-code-points-in and num-code-points-out?

Yes, simple case folding does.

> Because I believe that a case folding algorithm with that property would not
> mess up the Highlighting data.
>
> But then it looks like utf8proc only offers one CASEFOLD option.  I wonder
> which one it is, or if it's configurable.

It only offers full case folding afaics.

Simple case folding would work before tokenization but I still don't 
like the idea of allowing certain analyzers before tokenization if they 
don't add or remove codepoints. There might even be some long term gains 
if we move tokenization completely out of the analysis chain. The 
analyzers could work directly on tokens instead of inversions and we 
could employ a token cache, for example.

Nick

Mime
View raw message