couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noah Slater (JIRA)" <>
Subject [jira] Commented: (COUCHDB-345) "High ASCII" can be inserted into db but not retrieved
Date Sat, 29 Aug 2009 16:41:32 GMT


Noah Slater commented on COUCHDB-345:

I disagree with Curt.

The JSON RFC is either wrong or carelessly worded. You cannot encode anything as Unicode because
Unicode is not an encoding, it is a collection of code points that have no binary representation.
You can encode these code points into character data, and you can decode the same character
data into Unicode. Unicode is always some internal representation after decoding, and before
encoding. I am guessing everyone already knows this, but I keep seeing people form arguments
(particularly on IRC) that start with "since JSON has to be encoded as Unicode" which is just
a meaningless sentence (and the RFC is to blame as it uses this wording) and hence conclusions
that follow from that have tended to be false.

In this case, what the JSON RFC should say is that JSON should be encoded from Unicode, which
means that the encoding could be anything from ISO-8859-1 to Shift JIS, which means that we
cannot "unambiguously determine the encoding from the content." Even if we decided to only
allow UTF-8, UTF-16, or UTF-32, we could only "unambiguously determine the encoding" if the
request body included the BOM, which is entirely optional. So again, without the Content-Encoding
information, we are forced to use a heuristic. Heuristics already exist, and where they are
not already available in Erlang, I rather suspect that they can be ported with relative ease.

If we can access the Content-Encoding, we should absolutely use it, and absolutely reject
as garbage any request that could not be decoded with the explicit encoding. Any patch that
will-fully ignored this information only to fall back onto a heuristic would get my emphatic
veto. I am however satisfied with requiring UTF-8 in the short term, and adding Content-Encoding
awareness at some later point.

> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>                 Key: COUCHDB-345
>                 URL:
>             Project: CouchDB
>          Issue Type: Bug
>    Affects Versions: 0.9
>         Environment: OSX 10.5.6
>            Reporter: Joan Touzet
>         Attachments: badenc1.patch, badtext.tar.gz,, reject_invalid_utf8.patch
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value that cannot
be retrieved. This results from not escaping a non-ASCII value into \u#### when PUT/POSTing
the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø) in a possibly
unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
>     "ok": true
> }
> {
>     "id": "fail", 
>     "ok": true, 
>     "rev": "1-76726372"
> }
> {
>     "error": "ucs", 
>     "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the bad_utf8_character_code
exception thrown by a design document attempting to map() the bad document caused Futon to
fail silently in building the view, with no indication (except via debug log) that there was
a failure. The log indicated two attempts to build the view, both failing, followed by an
uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not handle the bad_utf8_character_code
exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have rejected the PUT/POST,
or should have escaped the input itself before the insertion.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message