lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-dev] Implementing a tokenizer in core
Date Wed, 30 Nov 2011 16:04:29 GMT
On Wed, Nov 30, 2011 at 04:40:00PM +0100, Nick Wellnhofer wrote:
> I had a closer look at the word boundary rules in UAX #29, and they  
> shouldn't be too hard to implement without using an external library. I  
> started with an initial prototype and it looks very promising.


> In order to lookup the Word_Break property values, we have to precompute  
> a few tables. I would write a Perl script for that. The tables can be  
> generated once and shipped with the source code much like the tables for  
> utf8proc. I'm not sure where to put that script and the generated  
> tables, though.

The script likely belongs in trunk/devel/bin.

The file with the generated tables could arguably go in a few different
places.  I would suggest either trunk/core/Lucy/Analysis/WordBreakTables.c
if the tables are specialized, or trunk/core/Lucy/Util/UnicodeProperties.c if
we anticipate adding more tables in the future.

The generated file will need to embed the Unicode license, and should not have
an ALv2 license header.  We will also need to add an entry in
trunk/devel/conf/rat-excludes so that the generated file doesn't get flagged
by the Apache RAT[1] license header check run by buildbot at

Lastly, we will need to add adapt LICENSE and NOTICE to accommodate the new
data.  I'll start a new thread for that, as recent conversations on
general@incubator indicated the need for additional changes to our LICENSE and
NOTICE files.

Marvin Humphrey

[2] Whoops, this is failing right now.  We need to deal with this before

View raw message