couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Curt Arnold (JIRA)" <>
Subject [jira] Commented: (COUCHDB-345) "High ASCII" can be inserted into db but not retrieved
Date Sat, 29 Aug 2009 05:56:32 GMT


Curt Arnold commented on COUCHDB-345:

ISO-8859-1, Cp1252 and Latin-1 are near synonyms for encoding the first 256 character points
in Unicode as single byte values and is incapable of representing any other character without
some escape mechanism.   Any arbitrary set of bytes would be a valid ISO-8859-1 sequence and
can be decoded into a sequence of Unicode characters.

UTF-8 is an variable byte encoding of the full Unicode character repertoire.  Character values
from \u0000 to \u007F are represented as a single-byte, while other characters require 2-6
bytes to encode.  Unlike ISO-8859-1, not every sequence of bytes is valid and can be converted
back to Unicode character points.  If I remember correctly, any two-bytes in a row with the
high-bit set is invalid.  The test date in the last two cases are valid ISO-8859-1 sequence,
but they can not be interpreted as UTF-8 since they contain byte sequences that can not be
converted back into Unicode code points.

If it was just an encoding mismatch and the data was being misinterpreted, you would lay the
blame at the client.  However, in this case, data can go into the database that the rest of
the stack can't process since it contains invalid sequences. 

The RFC mentions the two variants of UTF-16 and UCS-4, however the ISO-8859-1 sequences could
not be interpreted using any of those encodings since the first two characters must be ASCII.
 There are only certain sequences of bytes that could appear for JSON encoded in any of those
encodings and the byte sequences send in the last two cases don't match any of those patterns.
 Sniffing the encoding  would work in a similar manner to XML which is described in

> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>                 Key: COUCHDB-345
>                 URL:
>             Project: CouchDB
>          Issue Type: Bug
>    Affects Versions: 0.9
>         Environment: OSX 10.5.6
>            Reporter: Joan Touzet
>         Attachments: badtext.tar.gz,
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value that cannot
be retrieved. This results from not escaping a non-ASCII value into \u#### when PUT/POSTing
the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø) in a possibly
unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
>     "ok": true
> }
> {
>     "id": "fail", 
>     "ok": true, 
>     "rev": "1-76726372"
> }
> {
>     "error": "ucs", 
>     "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the bad_utf8_character_code
exception thrown by a design document attempting to map() the bad document caused Futon to
fail silently in building the view, with no indication (except via debug log) that there was
a failure. The log indicated two attempts to build the view, both failing, followed by an
uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not handle the bad_utf8_character_code
exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have rejected the PUT/POST,
or should have escaped the input itself before the insertion.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message