From Marvin Humphrey <>
Subject [lucy-dev] utf8proc, control chars and non-character code points
Date Wed, 14 Dec 2011 00:28:28 GMT

I just committed a test to trunk which verifies that utf8proc's normalization
works properly, in that normalizing a second time is a no-op.  However, I had
to disable the test because utf8proc chokes when fed strings which contain
either control characters or non-character code points.

The test uses random UTF-8 data, generated by TestUtils_random_string().  With
the hack below my sig, the test passes.

Strings which contain control characters are valid UTF-8, as are strings which
contain noncharacters.  Noncharacters are not supposed to be used for
interchange, but Lucy is a library, not an application, and thus should pass
noncharacters cleanly.

Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
reports an error, we simply leave the token alone.  That seems appropriate in
the case of malformed UTF-8, but I question whether it is appropriate for
valid UTF-8 sequences containing control characters or non-character code

Marvin Humphrey

Index: core/Lucy/Test/TestUtils.c
--- core/Lucy/Test/TestUtils.c  (revision 1213967)
+++ core/Lucy/Test/TestUtils.c  (working copy)
@@ -17,6 +17,7 @@
 #include "Lucy/Util/ToolSet.h"
 #include <string.h>
+#include <ctype.h>
 #include "Lucy/Test/TestUtils.h"
 #include "Lucy/Test.h"
@@ -106,6 +107,15 @@
         if (code_point > 0xD7FF && code_point < 0xE000) {
             continue; // UTF-16 surrogate.
+        if (iscntrl(code_point)) {
+            continue;
+        }
+        if ((code_point & 0xFFFF) == 0xFFEF
+            || (code_point & 0xFFFF) == 0xFFFF
+            || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
+           ) {
+            continue; // Unicode non-character code point.
+        }
     return code_point;

