lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <pe...@peknet.com>
Subject Re: [lucy-dev] Unicode integration
Date Fri, 18 Nov 2011 02:06:24 GMT
Marvin Humphrey wrote on 11/16/11 11:09 PM:
> On Wed, Nov 16, 2011 at 11:24:22PM +0100, Nick Wellnhofer wrote:

[snip]

>>
>> The default analyzer chain would be tokenize, normalize, stem.
> 
> The gist of your proposal seems sound.  It's great to see that you are
> thinking about all these things, and to see them all laid out here.
> 
> I don't see much to disagree with in your API choices, aside from the questions
> of what the default analyzer order should be and whether case_fold should be a
> boolean.  Neither of those quibbles block the proposal.
> 

+1 to that.

I've enjoyed following this thread, having wrestled with utf-8 analysis a lot in
libswish3[0]. I think robust utf-8 string handling in core is a win, especially
if it includes a relatively lightweight way of dealing with the Unicode tables
in a portable way.

+1 to utf8proc

Thanks for initiating this thread, Nick.

[0] http://s.apache.org/722

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message