Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1931A939C for ; Tue, 6 Dec 2011 21:43:35 +0000 (UTC) Received: (qmail 56187 invoked by uid 500); 6 Dec 2011 21:43:35 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 56160 invoked by uid 500); 6 Dec 2011 21:43:35 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 56152 invoked by uid 99); 6 Dec 2011 21:43:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Dec 2011 21:43:34 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.227.17.10] (HELO moutng.kundenserver.de) (212.227.17.10) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Dec 2011 21:43:27 +0000 Received: from [192.168.178.20] (mnch-5d8756a0.pool.mediaWays.net [93.135.86.160]) by mrelayeu.kundenserver.de (node=mreu3) with ESMTP (Nemesis) id 0Lyfid-1QkFoq07mx-015sRB; Tue, 06 Dec 2011 22:43:07 +0100 Message-ID: <4EDE8C66.1040807@aevum.de> Date: Tue, 06 Dec 2011 22:43:02 +0100 From: Nick Wellnhofer User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: lucy-dev@incubator.apache.org References: <20111205210243.03A7023889B8@eris.apache.org> <20111205213814.GB12562@rectangular.com> <20111206041648.GA13526@rectangular.com> <4EDE1C8A.60501@aevum.de> <20111206190830.GA28067@rectangular.com> In-Reply-To: <20111206190830.GA28067@rectangular.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Provags-ID: V02:K0:OJYcmFuAF+oKoxobUi5eDSRyqNraVfEFDzPR8LEa1YL H/dUZugBHonSAaXt61YV39yrGWVnw1eceAtX25MuTrrq7YE42J SfADVmWKDLhUmMo7UlZC25plrxrNvVFJ7eJEuQeemAUycUvAgi IqYF0Ib9+RIRvbtHlR6bHLZ+pi3oEBQCDe/hhG/7gkbXqluXHu vVxTyxA3V5tWfUVI/DZfgHE5HEuH3Ka+sq9NJpvNkG5ZZ1kr8a keFAx9csdO2C3FUK7zyuO3q11+o0gDvItRw+S6/79FTee56fT1 nx6Y3LRIF5fG8WwgOGH841vNhK9PjcTNdSf5ffUMTadaYbLqGi BruZV1BCn8vKkUHXiko0= X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] StandardTokenizer has landed On 06/12/11 20:08, Marvin Humphrey wrote: > On Tue, Dec 06, 2011 at 02:45:46PM +0100, Nick Wellnhofer wrote: >> This and similar schemes are widely used in Unicode processing. It isn't >> too complicated once you wrap your head around it. > > Yes, that's the impression I got. The data structures didn't seem hard -- > just a couple extra derefs, essentially. It's all the bit-twiddling that > makes it hard to grok -- but that's the price we pay for the efficiency of > bitwise operations. > > Can I ask you to please add some comments to S_wb_lookup() in > StandardTokenizer.c? Also, longer and more descriptive variable names would > be helpful -- it's not immediately clear what "i1", "i2", and "i3" represent. > > I'm pretty sure I can tease that lookup code apart if I try, but it would be > easier with narration -- and it will certainly be easier for people who > haven't done much hard-core UTF-8 and Unicode coding. I will try to make that code more descriptive. >> There's also a brief description in section 5.1 of the Unicode Standard. > > Looks like this: > > http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf > > Optimized Two-Stage Table. > > Wherever any blocks are identical, the pointers just point to the same > block. For transcoding tables, this case occurs generally for a block > containing only mappings to the default or “unmappable” character. Instead > of using NULL pointers and a default value, one “shared” block of default > entries is created. This block is pointed to by all first-stage table > entries, for which no character value can be mapped. By avoiding tests > and branches, this strategy provides access time that approaches the > simple array access, but at a great savings in storage. It's actually a three-stage table. See the next paragraph: Multistage Table Tuning. Given a table of arbitrary size and content, it is a relatively simple matter to write a small utility that can calculate the optimal number of stages and their width for a multistage table. Tuning the number of stages and the width of their arrays of index pointers can result in various trade-offs of table size versus average access time. It isn't really made clear but the whole process can again be applied to the first-stage table so you end up with three tables. If you use a 16-bit and an 8-bit split, you can think of Unicode planes and rows. >> What I still want to do is to incorporate the word break test cases from >> the Unicode website: >> >> http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt > > That would really help. Bitwise code is hard to debug visually. That's one > reason why Lucy::Object::BitVector and Lucy::Util::NumberUtils have unit tests > which are somewhat more thorough than other Lucy components. The word break tables are tested for all Unicode characters in gen_word_break_tables.pl but that doesn't cover the C code. I just implemented and committed the tests against the UCD test suite. There were two bugs in the implementation of the actual word break rules that I fixed. Now the complete test suite passes. Nick