Return-Path: X-Original-To: apmail-couchdb-commits-archive@www.apache.org Delivered-To: apmail-couchdb-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A951010D36 for ; Thu, 3 Oct 2013 15:51:47 +0000 (UTC) Received: (qmail 44542 invoked by uid 500); 3 Oct 2013 15:51:45 -0000 Delivered-To: apmail-couchdb-commits-archive@couchdb.apache.org Received: (qmail 44500 invoked by uid 500); 3 Oct 2013 15:51:45 -0000 Mailing-List: contact commits-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list commits@couchdb.apache.org Received: (qmail 44488 invoked by uid 99); 3 Oct 2013 15:51:44 -0000 Received: from tyr.zones.apache.org (HELO tyr.zones.apache.org) (140.211.11.114) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Oct 2013 15:51:44 +0000 Received: by tyr.zones.apache.org (Postfix, from userid 65534) id BC78990E9DE; Thu, 3 Oct 2013 15:51:44 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: jan@apache.org To: commits@couchdb.apache.org Date: Thu, 03 Oct 2013 15:51:44 -0000 Message-Id: <4a14bc4ad96e4e45ad9eacd7e636563a@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [1/2] git commit: updated refs/heads/master to 073f9e2 Updated Branches: refs/heads/master 532100c10 -> 073f9e252 Handle invalid UTF-8 byte sequences gracefully by replacing them with 0xFFFD CouchDB's Erlang JSON parser allows storing of invalid UTF-8 byte sequences. The Query Server inside CouchDB fails upon encountering these byte sequences. The view process fails for the current batch of document updates. The result is that the view is invariably broken. Only removing the document in question solves this otherwise, but finding that is hard as the `log()` inside the Query Server dies with the invalid byte sequence because our protocol is synchronous and map results an `log()` messages generated therein are submitted together. This patch replaces invalid bytes with the the surrogate chacracter 0xFFFD. Closes COUCHDB-1425. Patch by Sam Rijs and Paul Davis. Eventually, this should be fixed at the HTTP level, so that no documents with invalid byte sequences can be written to CouchDB. The jiffy encoder we'll get with BigCouch will do that for us. This is a fix for the releases until then. Project: http://git-wip-us.apache.org/repos/asf/couchdb/repo Commit: http://git-wip-us.apache.org/repos/asf/couchdb/commit/9195223b Tree: http://git-wip-us.apache.org/repos/asf/couchdb/tree/9195223b Diff: http://git-wip-us.apache.org/repos/asf/couchdb/diff/9195223b Branch: refs/heads/master Commit: 9195223b12f6aae993010eea338446d28ab63f54 Parents: 54813a7 Author: Jan Lehnardt Authored: Mon Mar 4 15:09:36 2013 +0100 Committer: Jan Lehnardt Committed: Wed Oct 2 15:04:49 2013 +0200 ---------------------------------------------------------------------- THANKS.in | 1 + src/couchdb/priv/couch_js/utf8.c | 29 ++++++++++++++++------------- 2 files changed, 17 insertions(+), 13 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/couchdb/blob/9195223b/THANKS.in ---------------------------------------------------------------------- diff --git a/THANKS.in b/THANKS.in index d82c23d..b87ffec 100644 --- a/THANKS.in +++ b/THANKS.in @@ -92,6 +92,7 @@ suggesting improvements or submitting changes. Some of these people are: * Fedor Indutny * Tim Blair * Tady Walsh + * Sam Rijs # Authors from commit 6c976bd and onwards are auto-inserted. If you are merging # a commit from a non-committer, you should not add an entry to this file. When # `bootstrap` is run, the actual THANKS file will be generated. http://git-wip-us.apache.org/repos/asf/couchdb/blob/9195223b/src/couchdb/priv/couch_js/utf8.c ---------------------------------------------------------------------- diff --git a/src/couchdb/priv/couch_js/utf8.c b/src/couchdb/priv/couch_js/utf8.c index d606426..2d23cc2 100644 --- a/src/couchdb/priv/couch_js/utf8.c +++ b/src/couchdb/priv/couch_js/utf8.c @@ -66,24 +66,31 @@ enc_charbuf(const jschar* src, size_t srclen, char* dst, size_t* dstlenp) c = *src++; srclen--; - if((c >= 0xDC00) && (c <= 0xDFFF)) goto bad_surrogate; - - if(c < 0xD800 || c > 0xDBFF) + if(c <= 0xD7FF || c >= 0xE000) { - v = c; + v = (uint32) c; } - else + else if(c >= 0xD800 && c <= 0xDBFF) { if(srclen < 1) goto buffer_too_small; c2 = *src++; srclen--; - if ((c2 < 0xDC00) || (c2 > 0xDFFF)) + if(c2 >= 0xDC00 && c2 <= 0xDFFF) + { + v = (uint32) (((c - 0xD800) << 10) + (c2 - 0xDC00) + 0x10000); + } + else { - c = c2; - goto bad_surrogate; + // Invalid second half of surrogate pair + v = (uint32) 0xFFFD; } - v = ((c - 0xD800) << 10) + (c2 - 0xDC00) + 0x10000; } + else + { + // Invalid first half surrogate pair + v = (uint32) 0xFFFD; + } + if(v < 0x0080) { /* no encoding necessary - performance hack */ @@ -109,10 +116,6 @@ enc_charbuf(const jschar* src, size_t srclen, char* dst, size_t* dstlenp) *dstlenp = (origDstlen - dstlen); return JS_TRUE; -bad_surrogate: - *dstlenp = (origDstlen - dstlen); - return JS_FALSE; - buffer_too_small: *dstlenp = (origDstlen - dstlen); return JS_FALSE;