lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <>
Subject Re: [lucy-dev] Unicode integration
Date Fri, 18 Nov 2011 02:06:24 GMT
Marvin Humphrey wrote on 11/16/11 11:09 PM:
> On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:


>> The default analyzer chain would be tokenize, normalize, stem.
> The gist of your proposal seems sound.  It's great to see that you are
> thinking about all these things, and to see them all laid out here.
> I don't see much to disagree with in your API choices, aside from the questions
> of what the default analyzer order should be and whether case_fold should be a
> boolean.  Neither of those quibbles block the proposal.

+1 to that.

I've enjoyed following this thread, having wrestled with utf-8 analysis a lot in
libswish3[0]. I think robust utf-8 string handling in core is a win, especially
if it includes a relatively lightweight way of dealing with the Unicode tables
in a portable way.

+1 to utf8proc

Thanks for initiating this thread, Nick.


Peter Karman  .  .

View raw message