incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] StandardTokenizer has landed
Date Tue, 06 Dec 2011 13:45:46 GMT
On 06/12/2011 05:16, Marvin Humphrey wrote:
> I didn't grok everything that was being done in the compressed table lookup
> scheme, but your code is as well-documented and easy to follow as anything
> that does that much bit-twiddling possibly could be, and I feel like I could
> dive in and work on it if the need arose.

This and similar schemes are widely used in Unicode processing. It isn't 
too complicated once you wrap your head around it. There's also a brief 
description in section 5.1 of the Unicode Standard.

I also made the assumption that the Tokenizer input is valid UTF-8. Is 
that true?

What I still want to do is to incorporate the word break test cases from 
the Unicode website:

http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt

I like the way the snowball stemmer tests read test data from JSON files 
using our own parser. So I'd convert the Unicode tests to JSON with a 
perl script. I saw that there is an issue with JSON files and RAT 
because we can't include a license header. Maybe we should put all 
Unicode database related material (also the word break tables) in a 
single directory like modules/unicode/ucd like the snowball stemmer.

Nick

Mime
View raw message