couchdb-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From j..@apache.org
Subject git commit: Handle invalid UTF-8 byte sequences gracefully by replacing them with 0xFFFD
Date Mon, 04 Mar 2013 20:08:46 GMT
Updated Branches:
  refs/heads/1425-fix-graceful-surrogate-handling be4e41ff2 -> 254d9d583 (forced update)


Handle invalid UTF-8 byte sequences gracefully by replacing them with 0xFFFD

CouchDB's Erlang JSON parser allows storing of invalid UTF-8 byte sequences.
The Query Server inside CouchDB fails upon necountering these byte sequences.
The view process fails for the current batch of document updates. The result
is that the view is invariably broken. Only removing the document in question
solves this otherwise, but finding that is hard as the `log()` inside the
Query Server dies with the invalid byte sequence because our protocol is
synchronous and map results an `log()` messages generated therein are
submitted together.

This patch replaces invalid bytes with the the surrogate chacracter 0xFFFD.

Closes COUCHDB-1425.

Patch by Sam Rijs <recv@awesan.de> and Paul Davis.

Eventually, this should be fixed at the HTTP level, so that no documents
with invalid byte sequences can be written to CouchDB. The jiffy encoder
we'll get with BigCouch will do that for us. This is a fix for the releases
until then.


Project: http://git-wip-us.apache.org/repos/asf/couchdb/repo
Commit: http://git-wip-us.apache.org/repos/asf/couchdb/commit/254d9d58
Tree: http://git-wip-us.apache.org/repos/asf/couchdb/tree/254d9d58
Diff: http://git-wip-us.apache.org/repos/asf/couchdb/diff/254d9d58

Branch: refs/heads/1425-fix-graceful-surrogate-handling
Commit: 254d9d5830000181eeb25f7533256ba4d5740b39
Parents: 2b8539d
Author: Jan Lehnardt <jan@apache.org>
Authored: Mon Mar 4 15:09:36 2013 +0100
Committer: Jan Lehnardt <jan@apache.org>
Committed: Mon Mar 4 21:07:59 2013 +0100

----------------------------------------------------------------------
 THANKS.in                        |    1 +
 src/couchdb/priv/couch_js/utf8.c |   29 ++++++++++++++++-------------
 2 files changed, 17 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/couchdb/blob/254d9d58/THANKS.in
----------------------------------------------------------------------
diff --git a/THANKS.in b/THANKS.in
index 4ebf3f0..db0ac07 100644
--- a/THANKS.in
+++ b/THANKS.in
@@ -94,6 +94,7 @@ suggesting improvements or submitting changes. Some of these people are:
  * Fedor Indutny <fedor@indutny.com>
  * Tim Blair
  * Tady Walsh <hello@tady.me>
+ * Sam Rijs <recv@awesam.de>
 # Authors from commit 6c976bd and onwards are auto-inserted. If you are merging
 # a commit from a non-committer, you should not add an entry to this file. When
 # `bootstrap` is run, the actual THANKS file will be generated.

http://git-wip-us.apache.org/repos/asf/couchdb/blob/254d9d58/src/couchdb/priv/couch_js/utf8.c
----------------------------------------------------------------------
diff --git a/src/couchdb/priv/couch_js/utf8.c b/src/couchdb/priv/couch_js/utf8.c
index d606426..2d23cc2 100644
--- a/src/couchdb/priv/couch_js/utf8.c
+++ b/src/couchdb/priv/couch_js/utf8.c
@@ -66,24 +66,31 @@ enc_charbuf(const jschar* src, size_t srclen, char* dst, size_t* dstlenp)
         c = *src++;
         srclen--;
 
-        if((c >= 0xDC00) && (c <= 0xDFFF)) goto bad_surrogate;
-        
-        if(c < 0xD800 || c > 0xDBFF)
+        if(c <= 0xD7FF || c >= 0xE000)
         {
-            v = c;
+            v = (uint32) c;
         }
-        else
+        else if(c >= 0xD800 && c <= 0xDBFF)
         {
             if(srclen < 1) goto buffer_too_small;
             c2 = *src++;
             srclen--;
-            if ((c2 < 0xDC00) || (c2 > 0xDFFF))
+            if(c2 >= 0xDC00 && c2 <= 0xDFFF)
+            {
+                v = (uint32) (((c - 0xD800) << 10) + (c2 - 0xDC00) + 0x10000);
+            }
+            else
             {
-                c = c2;
-                goto bad_surrogate;
+                // Invalid second half of surrogate pair
+                v = (uint32) 0xFFFD;
             }
-            v = ((c - 0xD800) << 10) + (c2 - 0xDC00) + 0x10000;
         }
+        else
+        {
+            // Invalid first half surrogate pair
+            v = (uint32) 0xFFFD;
+        }
+
         if(v < 0x0080)
         {
             /* no encoding necessary - performance hack */
@@ -109,10 +116,6 @@ enc_charbuf(const jschar* src, size_t srclen, char* dst, size_t* dstlenp)
     *dstlenp = (origDstlen - dstlen);
     return JS_TRUE;
 
-bad_surrogate:
-    *dstlenp = (origDstlen - dstlen);
-    return JS_FALSE;
-
 buffer_too_small:
     *dstlenp = (origDstlen - dstlen);
     return JS_FALSE;


Mime
View raw message