Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 94336 invoked from network); 30 Aug 2009 04:30:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Aug 2009 04:30:53 -0000 Received: (qmail 32202 invoked by uid 500); 30 Aug 2009 04:30:52 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 32118 invoked by uid 500); 30 Aug 2009 04:30:52 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 32108 invoked by uid 99); 30 Aug 2009 04:30:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Aug 2009 04:30:52 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [80.68.94.123] (HELO tumbolia.org) (80.68.94.123) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Aug 2009 04:30:40 +0000 Received: from nslater by tumbolia.org with local (Exim 4.69) (envelope-from ) id 1Mhc3Y-0006v5-5B for dev@couchdb.apache.org; Sun, 30 Aug 2009 05:30:20 +0100 Date: Sun, 30 Aug 2009 05:30:20 +0100 From: Noah Slater To: dev@couchdb.apache.org Subject: Re: Character encodings and JSON RFC (spun off from COUCHDB-345) Message-ID: <20090830043020.GA25697@tumbolia.org> Mail-Followup-To: dev@couchdb.apache.org References: <6068908F-F5D2-4F1D-9339-E7848B03A23F@apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <6068908F-F5D2-4F1D-9339-E7848B03A23F@apache.org> X-Noah: Awesome User-Agent: Mutt/1.5.18 (2008-05-17) X-Virus-Checked: Checked by ClamAV on apache.org On Sat, Aug 29, 2009 at 10:51:51PM -0500, Curt Arnold wrote: > I agree that is unfortunately worded. I checked the IETF RFC errata > page (http://www.rfc-editor.org/errata_search.php?rfc=4627) and did not > find a clarification on this issue. I basically interpreted "in > Unicode" as "in a Unicode Transformation Format" and more specifically > as in a UTF recommended by the Unicode consortium and in widespread use. "A string is a sequence of zero or more Unicode characters" - http://www.ietf.org/rfc/rfc4627.txt Only possible meaning is any encoding of any Unicode code point. "All Unicode characters may be placed within the quotation marks[...]" - ibid Ditto. "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." - ibid "encoded in Unicode" is nonsensical, so the only possible parsing is: "JSON text SHALL be encoded from Unicode. The default encoding is UTF-8." - ibid Which means that any encoding is good. Those are the only mentions of Unicode in the entire specification. If we search for "encoding" we get: "[...] the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair" - ibid "JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used." - ibid >From this we can conclude: * JSON can be encoded in UTF-8, UTF-16, or UTF-32. * The editor of RFC 4627 was high. > There is this quote from http://www.json.org/fatfree.html: >> >> The character encoding of JSON text is always Unicode. UTF-8 is the >> only encoding that makes sense on the wire, but UTF-16 and UTF-32 are >> also permitted. > > It still uses the troublesome meme "character encoding ... Unicode", > however it seems to be a stretch to read that and think that Shift-JIS, > ISO-8559-8, MacLatin, EBCDIC, etc are also fine and dandy. The RFC demonstrates conclusively that the only allowable encodings are: UTF-8, UTF-16, or UTF-32 > Don Box also seems to have a similar interpretation > (http://www.pluralsight.com/community/blogs/dbox/archive/2007/01/03/45560.aspx > ): >> >> 2. Like Joe Gregorio states in the comments on Tim's post, I also >> prefer JSON's simplification of only allowing UTF-*-based encoding. > > Tim Bray in http://www.tbray.org/ongoing/When/200x/2006/12/21/JSON had > an interesting comment in: >> >> I look at the Ruby JSON library, for example, and I see all these >> character encoding conversion routines; blecch. >> Use JSON · Seems easy to me; if you want to serialize a data structure >> that’s not too text-heavy and all you want is for the receiver to get >> the same data structure with minimal effort, and you trust the other >> end to get the i18n right, JSON is hunky-dory. > > This hints things in the field aren't all pristine UTF encodings > however. > > Probably best to ping the RFC editor to see if there is a clarification. UTF-8, UTF-16, or UTF-32. >> In this case, what the JSON RFC should say is that JSON should be >> encoded from Unicode, which means that the encoding could be anything >> from ISO-8859-1 to Shift JIS, which means that we cannot >> "unambiguously determine the encoding from the content." Even if we >> decided to only allow UTF-8, UTF-16, or UTF-32, we could only >> "unambiguously determine the encoding" if the request body included >> the BOM, which is entirely optional. So again, without the Content- >> Encoding information, we are forced to use a heuristic. Heuristics >> already exist, and where they are not already available in Erlang, I >> rather suspect that they can be ported with relative ease. > > I haven't worked through all the sequences, but since we know that the > first characters of a JSON string is either " " or "{", that should be > enough to unambiguously determine whether the only possible encoding of > the set UTF-8, UTF-16BE, UTF-16LE, or UTF-32. Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets. 00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8 - ibid >> If we can access the Content-Encoding, we should absolutely use it, >> and absolutely reject as garbage any request that could not be decoded >> with the explicit encoding. Any patch that will-fully ignored this >> information only to fall back onto a heuristic would get my emphatic >> veto. I am however satisfied with requiring UTF-8 in the short term, >> and adding Content-Encoding awareness at some later point. > > The RFC says "A JSON parser MAY accept non-JSON forms or extensions." > So unlike an XML processor that is prohibited from assuming ISO-8859-1 > if it encounters a invalid UTF-8 sequence, a JSON parser could > transparently assume ISO-8859-1 after encountering a bad UTF-8 sequence. > Whether that would be a good thing is debatable. ISO-8859-1 JSON is invalid JSON. > There are a couple of questions that could be addressed: > > 1. How to treat a JSON request entity that does not contain a Content- > Encoding header. Particularly when the entity is not consistent with > the expected encoding. Considering we can reliably determine the encoding, we can: * Ignore the Content-Encoding header completely. * Reject any request where the Content-Encoding header is wrong. The jerk in me wants to opt for the second option, but there is always Postel's. > 2. How to treat a JSON request with a specified Content-Encoding. What > encodings would be supported? What would CouchDB do for an unsupported > encoding? What would occur if the entity was not consistent with the > encoding? Ignore it or barf. See above dilemma of jerk vs. hippie approach. UTF-8, UTF-16, or UTF-32. Barf. Ignore it or barf. As above. > 3. What should CouchDB send when there is no "Accept-Charset" in the > request. UTF-8. > 4. What should CouchDB send where there is an "Accept-Charset" in the > request. Particularly if the request does not contain a UTF. 406 Not acceptable. > I think the current answers are: > > 1. Entity is interpreted as UTF-8. Currently if the encoding is > inconsistent, it is still committed to the database and bad things > happen later. If a fix for COUCHDB-345 is committed, then CouchDB would > reject the request with a 400. +1 > 2. Same as 1, Content-Encoding is not considered. Undecided. > 3. CouchDB always sends UTF-8. +1 > 4. Same as 3, Accept-Charset is not considered. -1 Come on, you gotta give me this. It's fun to send back 406! Stupid clients. > It is not a pressing issue for me and since COUCHDB-345 languished for > such a long time, I'm not thinking that many people are trying to push > other encodings into the database with the exceptions of people pushing > ISO-8859-1 up but not getting burned since their content hasn't yet > contained non ASCII characters. I apologise for not reading to (totally bonkers) RFC properly until now. Did I mention it's totally bonkers? Best, -- Noah Slater, http://tumbolia.org/nslater