lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-dev] StandardTokenizer has landed
Date Wed, 07 Dec 2011 14:44:13 GMT
On 07/12/2011 03:56, Marvin Humphrey wrote:
> I also wanted to double check what happens when invalid UTF-8 shows up.  It
> looks like the masking that's in place would force any bogus header bytes
> positioned as continuation bytes to be evaluated safely, so no problem there.
> The one thing that isn't clear to me is that it's impossible to overshoot the
> end of the compressed lookup table arrays.  I see that we're covered as far as
> the plane_index table goes:
>          if (plane_index>= WB_PLANE_MAP_SIZE) { return 0; }
>          plane_id  = wb_plane_map[plane_index];
> There aren't boundary checks for the other tables,

The other tables don't need boundary checks because they're indexed 
using an id from another table which are all safe to use.

I can see only two bad things that can happen with invalid UTF-8:

1. The tokenizer doesn't detect invalid UTF-8, so it will pass it to 
other analyzers, possibly creating even more invalid UTF-8.

2. If there's invalid UTF-8 near the end of the input buffer, we might 
read up to three bytes past the end of the buffer.

> but I see that you defined
> a bunch of size-related constants in the autogenerated file
> which haven't yet been used:
>    #define WB_PLANES_SHIFT 6
>    #define WB_PLANES_MASK  63
>    #define WB_PLANES_SIZE  1472
> Perhaps you were already planning to add stuff like this eventually?
>    #if (WB_ASCII_SIZE<  128)
>      #error "ASCII word break table too small"
>    #endif

The tables used to have different shift and mask values, therefore the 
SHIFT and MASK defines. Now they're fixed at 6 bits and the defines are 


View raw message