couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nuutti Kotivuori (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-1176) CouchDB accepts data which it cannot replicate (invalid UTF-8 json during replication)
Date Tue, 24 May 2011 10:18:47 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038487#comment-13038487
] 

Nuutti Kotivuori commented on COUCHDB-1176:
-------------------------------------------

The bug is in mochijson2.erl, where tokenize_string_fast (which is hand-written) allows for
invalid UTF-8, where as tokenize_string uses xmerl_ucs:to_utf8 to convert escapes to utf-8.
This is directly from the documentation of xmerl:

%%% UTF-8 support
%%% Possible errors encoding UTF-8:
%%%	- Non-character values (something other than 0 .. 2^31-1).
%%%	- Surrogate pair code in string.
%%%	- 16#FFFE or 16#FFFF character in string.

Either the same values should be rejected by tokenize_string_fast, or both places should accept
the values.

> CouchDB accepts data which it cannot replicate (invalid UTF-8 json during replication)
> --------------------------------------------------------------------------------------
>
>                 Key: COUCHDB-1176
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1176
>             Project: CouchDB
>          Issue Type: Bug
>    Affects Versions: 1.0.1, 1.0.2
>         Environment: CentOS 5.5 64bit
>            Reporter: Jaakko Sipari
>            Priority: Critical
>         Attachments: fffe_escaped.json, fffe_utf8.json
>
>
> CouchDB appears to treat some unicode characters as illegal when parsing escaped unicode
values (\uXXXX) during insert or update of a document.  These characters can however be inserted
to the database by using the UTF-8 encoding instead of escaping. An example value would be
an unicode value 0xFFFE which is escaped \uFFFE and as UTF-8 is represented by consecutive
bytes with values 0xEF 0xBF and 0xBE.
> Even though the documents are inserted using UTF-8 encoding without errors, couchdb always
serves them in the escaped form. This leads us to the actual problem we currently have. If
documents containing such unaccepted characters are inserted to couchdb by using UTF-8 encoding,
attempt to replicate the database will abort to first of those documents giving an error like
this:
> {"error":"json_encode","reason":"{bad_term,{nocatch,{invalid_json,<<\"[{\\\"ok\\\":{\\\"_id\\\":\\\"192058c4f81afc66c5bf883548004331\\\",\\\"_rev\\\":\\\"1-ad1c9dcee520d12abdf948d91e31cf15\\\",\\\"abc\\\":\\\"\\\\ufffe\\\",\\\"_revisions\\\":{\\\"start\\\":1,\\\"ids\\\":[\\\"ad1c9dcee520d12abdf948d91e31cf15\\\"]}}}]\\n\">>}}}"}
> Here are steps to reproduce:
> curl -X PUT http://localhost:5984/replicationtest_source
> curl -X PUT http://localhost:5984/replicationtest_target
> # Should fail
> curl -H "Content-Type:application/json" -X POST -d @fffe_escaped.json http://localhost:5984/replicationtest_source
> # Should succeed
> curl -H "Content-Type:application/json" -X POST -d @fffe_utf8.json http://localhost:5984/replicationtest_source
> # Should fail to json_encode error related to the previously inserted document
> curl -H "Content-Type:application/json" -X POST -d "{\"source\":\"http://localhost:5984/replicationtest_source\",\"target\":\"replicationtest_target\"}"
http://localhost:5984/_replicate
> If anyone has a quick fix for this (how to accept "invalid" escaped unicode characters
at least during replication), we would be more than happy to test it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message