lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-dev] StandardTokenizer has landed
Date Wed, 07 Dec 2011 02:56:26 GMT
On Tue, Dec 06, 2011 at 10:43:02PM +0100, Nick Wellnhofer wrote:
>> I'm pretty sure I can tease that lookup code apart if I try, but it would be
>> easier with narration -- and it will certainly be easier for people who
>> haven't done much hard-core UTF-8 and Unicode coding.
> I will try to make that code more descriptive.

Your improvements are great!

I now understand what this commit did:

    Optimize word break property lookup

    UTF-8 decoding and property lookup can be done in one pass. If we use
    index tables with a shift of 6 bits, the two passes align very nicely.
    We also optimize for ASCII characters, so this should be really fast.

I can now see where the UTF-8 decoding logic ends and the table lookup begins.
It looks like your table lookup code is indeed branchless, and I can't suggest
any way to optimize it further.

I also wanted to double check what happens when invalid UTF-8 shows up.  It
looks like the masking that's in place would force any bogus header bytes
positioned as continuation bytes to be evaluated safely, so no problem there.

The one thing that isn't clear to me is that it's impossible to overshoot the
end of the compressed lookup table arrays.  I see that we're covered as far as
the plane_index table goes:

        if (plane_index >= WB_PLANE_MAP_SIZE) { return 0; }
        plane_id  = wb_plane_map[plane_index];

There aren't boundary checks for the other tables, but I see that you defined
a bunch of size-related constants in the autogenerated file
which haven't yet been used:

  #define WB_PLANES_SHIFT 6
  #define WB_PLANES_MASK  63
  #define WB_PLANES_SIZE  1472

Perhaps you were already planning to add stuff like this eventually?

  #if (WB_ASCII_SIZE < 128)
    #error "ASCII word break table too small"

> It isn't really made clear but the whole process can again be applied to  
> the first-stage table so you end up with three tables. If you use a  
> 16-bit and an 8-bit split, you can think of Unicode planes and rows.

Perfect, got it!

This comment was especially helpful:

            // four byte sequence
            // 11110ppp 10pppppp 10rrrrrr 10cccccc

> The word break tables are tested for all Unicode characters in  
> but that doesn't cover the C code.
> I just implemented and committed the tests against the UCD test suite.  
> There were two bugs in the implementation of the actual word break rules  
> that I fixed. Now the complete test suite passes.

Passes for me, too, and runs clean under Valgrind.  Lookin' good!

Marvin Humphrey

View raw message