lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject [lucy-dev] Unicode integration
Date Tue, 15 Nov 2011 20:51:49 GMT

(Moving the "Custom analyzers" thread from lucy-user to lucy-dev)

On 15/11/11 05:22, Marvin Humphrey wrote:
> On Tue, Nov 15, 2011 at 02:41:22AM +0100, Nick Wellnhofer wrote:
>> Would it make sense to have all the Unicode functionality in the Lucy
>> core using a third party Unicode library? Or should we rely on the
>> Unicode support of the host language like we do for case folding?
> That hinges on the dependability, portability, licensing terms and
> ease-of-integration for this theoretical third party Unicode library.
> Dependencies are cool so long as we can bundle them, they don't take a million
> years to compile, they don't sabotage all the hard work we've done to make
> Lucy portable, etc.  (For a longer take on dependencies, see
> <>.)

If all dependencies must be bundled, we can rule out something like ICU 
[1] because it's simply too big.

One alternative I could find is utf8proc [2]. It's 20K of C code, 
MIT-licensed and used for Postgres extensions and a Ruby gem. It 
supports Unicode normalization, case folding and stripping of accents.

Then there's the Perl module Unicode::Normalize with very similar 
functionality. But I'm not sure if the Perl License is compatible with 
the Apache License.

One downside of bundling a Unicode library is that they all need some 
rather large tables. utf8proc comes with a 1.2 MB .c file containing the 
tables. The whole library compiles to about 500 KB. Unicode::Normalize 
builds its tables from the Unicode database files that come with Perl 
and compiles to about 300 KB. All this on i386, 32 bit.

On the positive side, we'd have things like case folding, normalization 
and accent stripping directly in core. We'd also get Unicode features 
for new host languages out of the box and it's the only way to make sure 
Unicode is handled consistently across different host languages and 
client platforms. The latter might be a rather academic concern, though.



View raw message