lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Unicode integration
Date Thu, 17 Nov 2011 05:09:58 GMT
On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:
> Thinking about the implications of Unicode in the analyzer chain, I've  
> come to the conclusion that the first step should always be  
> tokenization.

I'm not sure whether I concur, just yet.  Case folding after tokenization
likely means a minor performance hit, as it will be slightly more expensive to
casefold the text as individual tokens rather than in bulk.  I'm sure that the
degradation would be acceptable for the sake of correctness, but it would be
nice to explore all possibilities before we decide that it's required.

> In the current implementation the CaseFolder comes first in the chain by
> default.  But case folding (or lowercasing) can add or remove Unicode
> codepoints and mess with the character offsets for the highlighter.

Lucy::Analysis::CaseFolder is a rebadged KinoSearch::Analysis::LCNormalizer.
LCNormalizer simply applied lc() to the text to achieve case-insensitive
search.  When LCNormalizer was renamed to "CaseFolder", it gained the intent
that it would apply "Unicode case folding" -- though not the reality.

Here is some background for anyone following along who may be unfamiliar with
the distinction between Unicode case folding and other case manipulation
techniques:

    http://unicode.org/faq/casemap_charprop.html#2

    Q: What is the difference between case mapping and case folding?

    A: Case mapping or case conversion is a process whereby strings are
    converted to a particular form—uppercase, lowercase, or titlecase—possibly
    for display to the user. Case folding is primarily used for caseless
    comparison of text, such as identifiers in a computer program, rather than
    actual text transformation. Case folding in Unicode is based on the
    lowercase mapping, but includes additional changes to the source text to
    help make it language-insensitive and consistent. As a result, case-folded
    text should be used solely for internal processing and generally should
    not be stored or displayed to the end user.

The fact that CaseFolder is currently powered by the function that underlies
Perl's lc() means that it is buggy and incomplete.  It also means that Lucy
never had to choose between the various flavors of Unicode case folding.

I wonder: does either "common" or "simple" Unicode case folding preserve a
one-to-one relationship between num-code-points-in and num-code-points-out?
Because I believe that a case folding algorithm with that property would not
mess up the Highlighting data.

But then it looks like utf8proc only offers one CASEFOLD option.  I wonder
which one it is, or if it's configurable.

> See the attached script for a demonstration.

Ah, Turkish İ.  Sigh.

Thank you for going to the trouble to provide that excellent code sample.

> If we go with utf8proc, I would propose a new analyzer  
> Lucy::Analysis::Normalizer with the following interface:
>
> my $normalizer = Lucy::Analysis::Normalizer->new(
>     normalization_form => $string,
>     case_fold          => $bool,
>     strip_accents      => $bool,
> );

It seems that utf8proc offers composite string transformations.  I agree with
the basic concept of a Lucy Analyzer class which is a wrapper around those
composite capabilities.  

> normalization_form can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. The  
> decomposed forms won't play well with other analyzers but could be  
> easily added for completeness.
>
> I'm not sure whether we should default to  
> NFC or NFKC.
>
> case_fold and strip_accents are simple on/off switches. By default  
> case_fold is enabled and strip_accents disabled.
>
> The default analyzer chain would be tokenize, normalize, stem.

The gist of your proposal seems sound.  It's great to see that you are
thinking about all these things, and to see them all laid out here.

I don't see much to disagree with in your API choices, aside from the questions
of what the default analyzer order should be and whether case_fold should be a
boolean.  Neither of those quibbles block the proposal.

> Lucy::Analysis::CaseFolder could then be implemented as a subclass of  
> Lucy::Analysis::Normalizer for compatibility.

Makes sense.

> Further idea: implement a simple and fast tokenizer in core based on the  
> Unicode character class table provided with utf8proc.

Sounds interesting.  Presumably it would use a fixed pattern...

Marvin Humphrey



Mime
View raw message