Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4EDE8C66.1040807@aevum.de>
Date: Tue, 06 Dec 2011 22:43:02 +0100
From: Nick Wellnhofer <wellnhofer@aevum.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1;
 rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: lucy-dev@incubator.apache.org
References: <20111205210243.03A7023889B8@eris.apache.org>
 <20111205213814.GB12562@rectangular.com>
 <20111206041648.GA13526@rectangular.com> <4EDE1C8A.60501@aevum.de>
 <20111206190830.GA28067@rectangular.com>
In-Reply-To: <20111206190830.GA28067@rectangular.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [lucy-dev] StandardTokenizer has landed

On 06/12/11 20:08, Marvin Humphrey wrote:
> On Tue, Dec 06, 2011 at 02:45:46PM +0100, Nick Wellnhofer wrote:
>> This and similar schemes are widely used in Unicode processing. It isn't
>> too complicated once you wrap your head around it.
>
> Yes, that's the impression I got.  The data structures didn't seem hard --
> just a couple extra derefs, essentially.  It's all the bit-twiddling that
> makes it hard to grok -- but that's the price we pay for the efficiency of
> bitwise operations.
>
> Can I ask you to please add some comments to S_wb_lookup() in
> StandardTokenizer.c?  Also, longer and more descriptive variable names would
> be helpful -- it's not immediately clear what "i1", "i2", and "i3" represent.
>
> I'm pretty sure I can tease that lookup code apart if I try, but it would be
> easier with narration -- and it will certainly be easier for people who
> haven't done much hard-core UTF-8 and Unicode coding.

I will try to make that code more descriptive.

>> There's also a brief description in section 5.1 of the Unicode Standard.
>
> Looks like this:
>
>      http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
>
>      Optimized Two-Stage Table.
>
>      Wherever any blocks are identical, the pointers just point to the same
>      block. For transcoding tables, this case occurs generally for a block
>      containing only mappings to the default or “unmappable” character. Instead
>      of using NULL pointers and a default value, one “shared” block of default
>      entries is created. This block is pointed to by all first-stage table
>      entries, for which no character value can be mapped.  By avoiding tests
>      and branches, this strategy provides access time that approaches the
>      simple array access, but at a great savings in storage.

It's actually a three-stage table. See the next paragraph:

	Multistage Table Tuning.

	Given a table of arbitrary size and content, it is a relatively
	simple matter to write a small utility that can calculate the
	optimal number of stages and their width for a multistage
	table. Tuning the number of stages and the width of their
	arrays of index pointers can result in various trade-offs of
	table size versus average access time.

It isn't really made clear but the whole process can again be applied to 
the first-stage table so you end up with three tables. If you use a 
16-bit and an 8-bit split, you can think of Unicode planes and rows.

>> What I still want to do is to incorporate the word break test cases from
>> the Unicode website:
>>
>> http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt
>
> That would really help.  Bitwise code is hard to debug visually.  That's one
> reason why Lucy::Object::BitVector and Lucy::Util::NumberUtils have unit tests
> which are somewhat more thorough than other Lucy components.

The word break tables are tested for all Unicode characters in 
gen_word_break_tables.pl but that doesn't cover the C code.

I just implemented and committed the tests against the UCD test suite. 
There were two bugs in the implementation of the actual word break rules 
that I fixed. Now the complete test suite passes.

Nick