Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Date: Sun, 30 Aug 2009 05:30:20 +0100
From: Noah Slater <nslater@apache.org>
To: dev@couchdb.apache.org
Subject: Re: Character encodings and JSON RFC (spun off from COUCHDB-345)
Message-ID: <20090830043020.GA25697@tumbolia.org>
Mail-Followup-To: dev@couchdb.apache.org
References: <6068908F-F5D2-4F1D-9339-E7848B03A23F@apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <6068908F-F5D2-4F1D-9339-E7848B03A23F@apache.org>
User-Agent: Mutt/1.5.18 (2008-05-17)

On Sat, Aug 29, 2009 at 10:51:51PM -0500, Curt Arnold wrote:
> I agree that is unfortunately worded.  I checked the IETF RFC errata
> page (http://www.rfc-editor.org/errata_search.php?rfc=4627) and did not
> find a clarification on this issue.  I basically interpreted "in
> Unicode" as "in a Unicode Transformation Format" and more specifically
> as in a UTF recommended by the Unicode consortium and in widespread use.

  "A string is a sequence of zero or more Unicode characters"

                                           - http://www.ietf.org/rfc/rfc4627.txt

Only possible meaning is any encoding of any Unicode code point.

  "All Unicode characters may be placed within the quotation marks[...]"

                                                                          - ibid

Ditto.

  "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

                                                                          - ibid

"encoded in Unicode" is nonsensical, so the only possible parsing is:

  "JSON text SHALL be encoded from Unicode. The default encoding is UTF-8."

                                                                          - ibid

Which means that any encoding is good.

Those are the only mentions of Unicode in the entire specification.

If we search for "encoding" we get:

  "[...] the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair"

                                                                          - ibid

  "JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written
  in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32,
  the binary content-transfer-encoding must be used."

                                                                          - ibid

>From this we can conclude:

  * JSON can be encoded in UTF-8, UTF-16, or UTF-32.

  * The editor of RFC 4627 was high.

> There is this quote from http://www.json.org/fatfree.html:
>>
>> The character encoding of JSON text is always Unicode. UTF-8 is the
>> only encoding that makes sense on the wire, but UTF-16 and UTF-32 are
>> also permitted.
>
> It still uses the troublesome meme "character encoding ... Unicode",
> however it seems to be a stretch to read that and think that Shift-JIS,
> ISO-8559-8, MacLatin, EBCDIC, etc are also fine and dandy.

The RFC demonstrates conclusively that the only allowable encodings are:

  UTF-8, UTF-16, or UTF-32

> Don Box also seems to have a similar interpretation
> (http://www.pluralsight.com/community/blogs/dbox/archive/2007/01/03/45560.aspx
> ):
>>
>> 2. Like Joe Gregorio states in the comments on Tim's post, I also
>> prefer JSON's simplification of only allowing UTF-*-based encoding.
>
> Tim Bray in http://www.tbray.org/ongoing/When/200x/2006/12/21/JSON had
> an interesting comment in:
>>
>> I look at the Ruby JSON library, for example, and I see all these
>> character encoding conversion routines; blecch.
>> Use JSON · Seems easy to me; if you want to serialize a data structure
>> that’s not too text-heavy and all you want is for the receiver to get
>> the same data structure with minimal effort, and you trust the other
>> end to get the i18n right, JSON is hunky-dory.
>
> This hints things in the field aren't all pristine UTF encodings
> however.
>
> Probably best to ping the RFC editor to see if there is a clarification.

UTF-8, UTF-16, or UTF-32.

>> In this case, what the JSON RFC should say is that JSON should be
>> encoded from Unicode, which means that the encoding could be anything
>> from ISO-8859-1 to Shift JIS, which means that we cannot
>> "unambiguously determine the encoding from the content." Even if we
>> decided to only allow UTF-8, UTF-16, or UTF-32, we could only
>> "unambiguously determine the encoding" if the request body included
>> the BOM, which is entirely optional. So again, without the Content-
>> Encoding information, we are forced to use a heuristic. Heuristics
>> already exist, and where they are not already available in Erlang, I
>> rather suspect that they can be ported with relative ease.
>
> I haven't worked through all the sequences, but since we know that the
> first characters of a JSON string is either " " or "{", that should be
> enough to unambiguously determine whether the only possible encoding of
> the set UTF-8, UTF-16BE, UTF-16LE, or UTF-32.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

                                                                          - ibid

>> If we can access the Content-Encoding, we should absolutely use it,
>> and absolutely reject as garbage any request that could not be decoded
>> with the explicit encoding. Any patch that will-fully ignored this
>> information only to fall back onto a heuristic would get my emphatic
>> veto. I am however satisfied with requiring UTF-8 in the short term,
>> and adding Content-Encoding awareness at some later point.
>
> The RFC says "A JSON parser MAY accept non-JSON forms or extensions."
> So unlike an XML processor that is prohibited from assuming ISO-8859-1
> if it encounters a invalid UTF-8 sequence, a JSON parser could
> transparently assume ISO-8859-1 after encountering a bad UTF-8 sequence.
> Whether that would be a good thing  is debatable.

ISO-8859-1 JSON is invalid JSON.

> There are a couple of questions that could be addressed:
>
> 1. How to treat a JSON request entity that does not contain a Content-
> Encoding header.  Particularly when the entity is not consistent with
> the expected encoding.

Considering we can reliably determine the encoding, we can:

  * Ignore the Content-Encoding header completely.

  * Reject any request where the Content-Encoding header is wrong.

The jerk in me wants to opt for the second option, but there is always Postel's.

> 2. How to treat a JSON request with a specified Content-Encoding.  What
> encodings would be supported?  What would CouchDB do for an unsupported
> encoding?  What would occur if the entity was not consistent with the
> encoding?

Ignore it or barf. See above dilemma of jerk vs. hippie approach.

UTF-8, UTF-16, or UTF-32.

Barf.

Ignore it or barf. As above.

> 3. What should CouchDB send when there is no "Accept-Charset" in the
> request.

UTF-8.

> 4. What should CouchDB send where there is an "Accept-Charset" in the
> request.  Particularly if the request does not contain a UTF.

406 Not acceptable.

> I think the current answers are:
>
> 1. Entity is interpreted as UTF-8.  Currently if the encoding is
> inconsistent, it is still committed to the database and bad things
> happen later.  If a fix for COUCHDB-345 is committed, then CouchDB would
> reject the request with a 400.

+1

> 2. Same as 1, Content-Encoding is not considered.

Undecided.

> 3. CouchDB always sends UTF-8.

+1

> 4. Same as 3, Accept-Charset is not considered.

-1

Come on, you gotta give me this. It's fun to send back 406! Stupid clients.

> It is not a pressing issue for me and since COUCHDB-345 languished for
> such a long time, I'm not thinking that many people are trying to push
> other encodings into the database with the exceptions of people pushing
> ISO-8859-1 up but not getting burned since their content hasn't yet
> contained non ASCII characters.

I apologise for not reading to (totally bonkers) RFC properly until now.

Did I mention it's totally bonkers?

Best,

-- 
Noah Slater, http://tumbolia.org/nslater