incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] StandardTokenizer has landed
Date Tue, 06 Dec 2011 19:08:30 GMT
On Tue, Dec 06, 2011 at 02:45:46PM +0100, Nick Wellnhofer wrote:
> On 06/12/2011 05:16, Marvin Humphrey wrote:
>> I didn't grok everything that was being done in the compressed table lookup
>> scheme, but your code is as well-documented and easy to follow as anything
>> that does that much bit-twiddling possibly could be, and I feel like I could
>> dive in and work on it if the need arose.
>
> This and similar schemes are widely used in Unicode processing. It isn't  
> too complicated once you wrap your head around it.

Yes, that's the impression I got.  The data structures didn't seem hard --
just a couple extra derefs, essentially.  It's all the bit-twiddling that
makes it hard to grok -- but that's the price we pay for the efficiency of
bitwise operations.

Can I ask you to please add some comments to S_wb_lookup() in
StandardTokenizer.c?  Also, longer and more descriptive variable names would
be helpful -- it's not immediately clear what "i1", "i2", and "i3" represent.

I'm pretty sure I can tease that lookup code apart if I try, but it would be
easier with narration -- and it will certainly be easier for people who
haven't done much hard-core UTF-8 and Unicode coding.

> There's also a brief description in section 5.1 of the Unicode Standard.

Looks like this:

    http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

    Optimized Two-Stage Table.
    
    Wherever any blocks are identical, the pointers just point to the same
    block. For transcoding tables, this case occurs generally for a block
    containing only mappings to the default or “unmappable” character. Instead
    of using NULL pointers and a default value, one “shared” block of default
    entries is created. This block is pointed to by all first-stage table
    entries, for which no character value can be mapped.  By avoiding tests
    and branches, this strategy provides access time that approaches the
    simple array access, but at a great savings in storage.

> I also made the assumption that the Tokenizer input is valid UTF-8. Is  
> that true?

That is enforced in the default indexing chain by the fact that
Lucy_ViewCB_Assign_Str() performs a UTF-8 validity check in this section of
trunk/perl/xs/Lucy/Index/Inverter.c:

        // Get the field value, forcing text fields to UTF-8.
        switch (Lucy_FType_Primitive_ID(type) & lucy_FType_PRIMITIVE_ID_MASK) {
            case lucy_FType_TEXT: {
                    STRLEN val_len;
                    char *val_ptr = SvPVutf8(value_sv, val_len);
                    lucy_ViewCharBuf *value
                        = (lucy_ViewCharBuf*)inv_entry->value;
                    Lucy_ViewCB_Assign_Str(value, val_ptr, val_len);
                    break;
                }

Token_Set_Text(), Token_new() and others do not enforce UTF-8 validity.  So
the answer to your question is that source material which comes in through
Indexer should be clean unless the chain contains a broken Analyzer, but lower
level APIs do not currently make guarantees.

We could probably stand to be more deliberate about where we check for UTF-8
validity in the indexing/analysis chain.  It's important for efficiency
reasons that modules such as StandardTokenizer be able to assume UTF-8 sanity
-- however I think it's important that feeding such a module invalid UTF-8
result in a exception (uncatchable if need be) rather than a potentially
exploitable memory error.

> What I still want to do is to incorporate the word break test cases from  
> the Unicode website:
>
> http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.txt

That would really help.  Bitwise code is hard to debug visually.  That's one
reason why Lucy::Object::BitVector and Lucy::Util::NumberUtils have unit tests
which are somewhat more thorough than other Lucy components.

> I like the way the snowball stemmer tests read test data from JSON files  
> using our own parser. So I'd convert the Unicode tests to JSON with a  
> perl script.

+1 

> I saw that there is an issue with JSON files and RAT  because we can't
> include a license header.

I think we should rid ourselves of this annoyance by modifying our JSON parser
to handle comments, probably line comments starting with "#".

If others agree, I'll open an issue.  Handling pound comments won't be that
hard to implement, and it would benefit the project if someone else were to
gain a little familiarity the JSON parser.

> Maybe we should put all  Unicode database related material (also the word
> break tables) in a  single directory like modules/unicode/ucd like the
> snowball stemmer.

Works for me.

FWIW, "modules" was originally intended as a place to put components which
would be optional.  Theoretically, we only need Analyzer itself in core -- all
of the subclasses can live outside, to be compiled as separate shared objects
or not.  But there's other work that needs to happen before we can compile
anything other than a monolithic Lucy shared object.

Marvin Humphrey


Mime
View raw message