incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] utf8proc, control chars and non-character code points
Date Wed, 14 Dec 2011 14:31:53 GMT
On 14/12/2011 01:28, Marvin Humphrey wrote:
> I just committed a test to trunk which verifies that utf8proc's normalization
> works properly, in that normalizing a second time is a no-op.  However, I had
> to disable the test because utf8proc chokes when fed strings which contain
> either control characters or non-character code points.

You're right that utf8proc doesn't allow non-characters but I don't 
think that control characters are blocked.

> contain noncharacters.  Noncharacters are not supposed to be used for
> interchange, but Lucy is a library, not an application, and thus should pass
> noncharacters cleanly.

By that argument we could also remove the check for Unicode surrogates. 
OTOH, passing UTF-8 to a library is a kind of interchange.

> Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
> reports an error, we simply leave the token alone.  That seems appropriate in
> the case of malformed UTF-8, but I question whether it is appropriate for
> valid UTF-8 sequences containing control characters or non-character code
> points.

We should either remove the check for non-characters from utf8proc or 
disallow non-characters in the rest of Lucy. I'm fine with either solution.

> +        if ((code_point&  0xFFFF) == 0xFFEF

This should check for 0xFFFE.

Nick

Mime
View raw message