incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Unicode integration
Date Fri, 18 Nov 2011 03:52:12 GMT
On Thu, Nov 17, 2011 at 02:05:44PM +0100, Nick Wellnhofer wrote:
> On 17/11/2011 06:09, Marvin Humphrey wrote:
>> I wonder: does either "common" or "simple" Unicode case folding preserve a
>> one-to-one relationship between num-code-points-in and num-code-points-out?
>
> Yes, simple case folding does.

OK, I remain at least academically interested in what sort of performance
advantages 'simple' case folding affords us, and at what penalty in terms of
relevancy.

However, as you and Robert are in agreement that NFKC_CF should work well, and
since utf8proc apparently only supports full casefolding anyhow, I'd love to
see an implementation.

I've been working today on pulling utf8proc into our repository:

    https://issues.apache.org/jira/browse/LUCY-189
    https://issues.apache.org/jira/browse/LEGAL-110

With those patches applied, utf8proc is integrated into our build, and you can
pound-include "utf8proc.h" from any C file under either core/ or perl/xs/.

> It only offers full case folding afaics.

Okeedoke -- thanks for looking into the matter.

> Simple case folding would work before tokenization but I still don't  
> like the idea of allowing certain analyzers before tokenization if they  
> don't add or remove codepoints. There might even be some long term gains  
> if we move tokenization completely out of the analysis chain.
> The analyzers could work directly on tokens instead of inversions and we
> could employ a token cache, for example.

Since we redacted the Analyzer subclassing API, we have a lot of freedom to
make such experiments!

Marvin Humphrey


Mime
View raw message