lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <pe...@peknet.com>
Subject Re: [lucy-dev] utf8proc, control chars and non-character code points
Date Wed, 14 Dec 2011 10:18:28 GMT
Marvin Humphrey wrote on 12/13/11 6:28 PM:
> Greets,
> 
> I just committed a test to trunk which verifies that utf8proc's normalization
> works properly, in that normalizing a second time is a no-op.  However, I had
> to disable the test because utf8proc chokes when fed strings which contain
> either control characters or non-character code points.
> 
>     http://svn.apache.org/viewvc?view=revision&revision=1213996
> 
> The test uses random UTF-8 data, generated by TestUtils_random_string().  With
> the hack below my sig, the test passes.
> 
> Strings which contain control characters are valid UTF-8, as are strings which
> contain noncharacters.  Noncharacters are not supposed to be used for
> interchange, but Lucy is a library, not an application, and thus should pass
> noncharacters cleanly.
> 
>     http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
> 
> Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
> reports an error, we simply leave the token alone.  That seems appropriate in
> the case of malformed UTF-8, but I question whether it is appropriate for
> valid UTF-8 sequences containing control characters or non-character code
> points.


Swish3 uses \003 control character as an internal field delimiter so passing
that through is pretty vital. Are you saying that utf8proc chokes on that valid
UTF-8 sequence?



> 
> Index: core/Lucy/Test/TestUtils.c
> ===================================================================
> --- core/Lucy/Test/TestUtils.c  (revision 1213967)
> +++ core/Lucy/Test/TestUtils.c  (working copy)
> @@ -17,6 +17,7 @@
>  #define C_LUCY_TESTUTILS
>  #include "Lucy/Util/ToolSet.h"
>  #include <string.h>
> +#include <ctype.h>
>  
>  #include "Lucy/Test/TestUtils.h"
>  #include "Lucy/Test.h"
> @@ -106,6 +107,15 @@
>          if (code_point > 0xD7FF && code_point < 0xE000) {
>              continue; // UTF-16 surrogate.
>          }
> +        if (iscntrl(code_point)) {
> +            continue;
> +        }
> +        if ((code_point & 0xFFFF) == 0xFFEF
> +            || (code_point & 0xFFFF) == 0xFFFF
> +            || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
> +           ) {
> +            continue; // Unicode non-character code point.
> +        }
>          break;
>      }
>      return code_point;
> 
> 


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message