couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Curt Arnold <carn...@apache.org>
Subject Character encodings and JSON RFC (spun off from COUCHDB-345)
Date Sun, 30 Aug 2009 03:51:51 GMT
I'm spinning a discussion that was occurring in COUCHDB-345 (http://issues.apache.org/jira/browse/COUCHDB-345

) over to the mailing list since it was growing beyond the immediate  
issue reported.  The reported problem was that CouchDB would accept  
PUT requests without checking that the content contained valid UTF-8  
encoded data which would result in documents that could not be  
retrieved, would disrupt view generation and potentially have other  
adverse side-effects.

Noah Slater added a comment - 29/Aug/09 09:40 AM
> I disagree with Curt.
>
> The JSON RFC is either wrong or carelessly worded. You cannot encode  
> anything as Unicode because Unicode is not an encoding, it is a  
> collection of code points that have no binary representation. You  
> can encode these code points into character data, and you can decode  
> the same character data into Unicode. Unicode is always some  
> internal representation after decoding, and before encoding. I am  
> guessing everyone already knows this, but I keep seeing people form  
> arguments (particularly on IRC) that start with "since JSON has to  
> be encoded as Unicode" which is just a meaningless sentence (and the  
> RFC is to blame as it uses this wording) and hence conclusions that  
> follow from that have tended to be false.


I agree that is unfortunately worded.  I checked the IETF RFC errata  
page (http://www.rfc-editor.org/errata_search.php?rfc=4627) and did  
not find a clarification on this issue.  I basically interpreted "in  
Unicode" as "in a Unicode Transformation Format" and more specifically  
as in a UTF recommended by the Unicode consortium and in widespread use.


There is this quote from http://www.json.org/fatfree.html:
>
> The character encoding of JSON text is always Unicode. UTF-8 is the  
> only encoding that makes sense on the wire, but UTF-16 and UTF-32  
> are also permitted.

It still uses the troublesome meme "character encoding ... Unicode",  
however it seems to be a stretch to read that and think that Shift- 
JIS, ISO-8559-8, MacLatin, EBCDIC, etc are also fine and dandy.

Don Box also seems to have a similar interpretation (http://www.pluralsight.com/community/blogs/dbox/archive/2007/01/03/45560.aspx

):
>
> 2. Like Joe Gregorio states in the comments on Tim's post, I also  
> prefer JSON's simplification of only allowing UTF-*-based encoding.

Tim Bray in http://www.tbray.org/ongoing/When/200x/2006/12/21/JSON had  
an interesting comment in:
>
> I look at the Ruby JSON library, for example, and I see all these  
> character encoding conversion routines; blecch.
> Use JSON · Seems easy to me; if you want to serialize a data  
> structure that’s not too text-heavy and all you want is for the  
> receiver to get the same data structure with minimal effort, and you  
> trust the other end to get the i18n right, JSON is hunky-dory.

This hints things in the field aren't all pristine UTF encodings  
however.

Probably best to ping the RFC editor to see if there is a clarification.

>
> In this case, what the JSON RFC should say is that JSON should be  
> encoded from Unicode, which means that the encoding could be  
> anything from ISO-8859-1 to Shift JIS, which means that we cannot  
> "unambiguously determine the encoding from the content." Even if we  
> decided to only allow UTF-8, UTF-16, or UTF-32, we could only  
> "unambiguously determine the encoding" if the request body included  
> the BOM, which is entirely optional. So again, without the Content- 
> Encoding information, we are forced to use a heuristic. Heuristics  
> already exist, and where they are not already available in Erlang, I  
> rather suspect that they can be ported with relative ease.

I haven't worked through all the sequences, but since we know that the  
first characters of a JSON string is either " " or "{", that should be  
enough to unambiguously determine whether the only possible encoding  
of the set UTF-8, UTF-16BE, UTF-16LE, or UTF-32.


>
> If we can access the Content-Encoding, we should absolutely use it,  
> and absolutely reject as garbage any request that could not be  
> decoded with the explicit encoding. Any patch that will-fully  
> ignored this information only to fall back onto a heuristic would  
> get my emphatic veto. I am however satisfied with requiring UTF-8 in  
> the short term, and adding Content-Encoding awareness at some later  
> point.

The RFC says "A JSON parser MAY accept non-JSON forms or extensions."   
So unlike an XML processor that is prohibited from assuming ISO-8859-1  
if it encounters a invalid UTF-8 sequence, a JSON parser could  
transparently assume ISO-8859-1 after encountering a bad UTF-8  
sequence.  Whether that would be a good thing  is debatable.

There are a couple of questions that could be addressed:

1. How to treat a JSON request entity that does not contain a Content- 
Encoding header.  Particularly when the entity is not consistent with  
the expected encoding.

2. How to treat a JSON request with a specified Content-Encoding.   
What encodings would be supported?  What would CouchDB do for an  
unsupported encoding?  What would occur if the entity was not  
consistent with the encoding?

3. What should CouchDB send when there is no "Accept-Charset" in the  
request.

4. What should CouchDB send where there is an "Accept-Charset" in the  
request.  Particularly if the request does not contain a UTF.

I think the current answers are:

1. Entity is interpreted as UTF-8.  Currently if the encoding is  
inconsistent, it is still committed to the database and bad things  
happen later.  If a fix for COUCHDB-345 is committed, then CouchDB  
would reject the request with a 400.

2. Same as 1, Content-Encoding is not considered.

3. CouchDB always sends UTF-8.

4. Same as 3, Accept-Charset is not considered.


It is not a pressing issue for me and since COUCHDB-345 languished for  
such a long time, I'm not thinking that many people are trying to push  
other encodings into the database with the exceptions of people  
pushing ISO-8859-1 up but not getting burned since their content  
hasn't yet contained non ASCII characters.


Mime
View raw message