From Nick Wellnhofer <>
Subject Re: [lucy-dev] Unicode integration
Date Wed, 16 Nov 2011 22:24:22 GMT
On 16/11/11 04:49, Marvin Humphrey wrote:
> It would be great to support accent stripping in Lucy -- that's something a
> lot of people need.  Normalization would also be a nice feature to offer
> (Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
> replacement?).

Thinking about the implications of Unicode in the analyzer chain, I've 
come to the conclusion that the first step should always be 
tokenization. In the current implementation the CaseFolder comes first 
in the chain by default. But case folding (or lowercasing) can add or 
remove Unicode codepoints and mess with the character offsets for the 
highlighter. See the attached script for a demonstration.

> It would also be great to migrate Lucy::Analysis::CaseFolder code away from
> its dependency on the Perl C API.

Yes, we could even do proper Unicode case folding, normalization and 
accent stripping in one pass with utf8proc. This should be the next step 
after tokenization. The stopalizer and stemmers should be safe when 
using NFC or NFKC. I think we can leave the choice between these 
normalization forms to the user.

If we go with utf8proc, I would propose a new analyzer 
Lucy::Analysis::Normalizer with the following interface:

my $normalizer = Lucy::Analysis::Normalizer->new(
     normalization_form => $string,
     case_fold          => $bool,
     strip_accents      => $bool,

normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The 
decomposed forms won't play well with other analyzers but could be 
easily added for completeness. I'm not sure whether we should default to 

case_fold and strip_accents are simple on/off switches. By default 
case_fold is enabled and strip_accents disabled.

The default analyzer chain would be tokenize, normalize, stem.

Lucy::Analysis::CaseFolder could then be implemented as a subclass of 
Lucy::Analysis::Normalizer for compatibility.

Further idea: implement a simple and fast tokenizer in core based on the 
Unicode character class table provided with utf8proc.


