lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: [lucy-dev] Unicode integration
Date Thu, 17 Nov 2011 00:46:50 GMT
On Wed, Nov 16, 2011 at 5:24 PM, Nick Wellnhofer <> wrote:
> On 16/11/11 04:49, Marvin Humphrey wrote:
>> It would be great to support accent stripping in Lucy -- that's something
>> a
>> lot of people need.  Normalization would also be a nice feature to offer
>> (Maybe we should make it the first step of PolyAnalyzer or PolyAnalyzer's
>> replacement?).
> Thinking about the implications of Unicode in the analyzer chain, I've come
> to the conclusion that the first step should always be tokenization. In the
> current implementation the CaseFolder comes first in the chain by default.
> But case folding (or lowercasing) can add or remove Unicode codepoints and
> mess with the character offsets for the highlighter. See the attached script
> for a demonstration.
>> It would also be great to migrate Lucy::Analysis::CaseFolder code away
>> from
>> its dependency on the Perl C API.
> Yes, we could even do proper Unicode case folding, normalization and accent
> stripping in one pass with utf8proc. This should be the next step after
> tokenization. The stopalizer and stemmers should be safe when using NFC or
> NFKC. I think we can leave the choice between these normalization forms to
> the user.
> If we go with utf8proc, I would propose a new analyzer
> Lucy::Analysis::Normalizer with the following interface:
> my $normalizer = Lucy::Analysis::Normalizer->new(
>    normalization_form => $string,
>    case_fold          => $bool,
>    strip_accents      => $bool,
> );
> normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The
> decomposed forms won't play well with other analyzers but could be easily
> added for completeness. I'm not sure whether we should default to NFC or
> case_fold and strip_accents are simple on/off switches. By default case_fold
> is enabled and strip_accents disabled.

Does your unicode library also support "NFKC_CaseFold" ? It might be a
nice default:

# Derived Property:   NFKC_Casefold (NFKC_CF)
#   This property removes certain variations from characters: case,
compatibility, and default-ignorables.
#   It is used for loose matching and certain types of identifiers.
#   It is constructed by applying NFKC, CaseFolding, and removal of
#   The process of applying these transformations is repeated until a
stable result is produced.


View raw message