couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noah Slater <>
Subject Re: Character encodings and JSON RFC (spun off from COUCHDB-345)
Date Sun, 30 Aug 2009 04:56:45 GMT
On Sun, Aug 30, 2009 at 12:33:19AM -0400, Paul Davis wrote:
> Can UTF-8 represent all possible unicode encodings?

I could reinterpret this nonsense as:

  Can UTF-8 represent all possible Unicode code points?

To which I would have replied "yes."

> I'm gonna assume yes for the rest of this post.


> I think we're in a bit of a weird spot here cause we're playing with
> the head butt of two different RFC's. The HTTP transport RFC that
> deals with Content-Encoding and charset awesomeness and the JSON RFC
> that is so full of ambiguity I'd like to kick it. On the plus side,
> there's so much ambiguity here that we can basically do whatever we
> want and no one can accuse us of being wrong.

Actually, JSON spells out what we MUST do here:

  JSON may be represented using UTF-8, UTF-16, or UTF-32.  When JSON
  is written in UTF-8, JSON is 8bit compatible.  When JSON is
  written in UTF-16 or UTF-32, the binary content-transfer-encoding
  must be used.

To be honest though, I think that using a content-transfer-encoding header is
further proof that the RFC editor was high. I mean, seriously, this doesn't even
make any sense. JSON can be UTF-8, UTF-16, or UTF-32 in a non-network context,
but if you send it over the wire, it's suddenly a transfer-encoding? It almost
sounds like they were playing technical jargon bingo.

Unicode! Transfer Encoding! Do I win a prize?

> That said, I think we should isolate concerns. Unless someone want's
> to write a JSON parser that understands multiple character encodings
> and doesn't suck ass performance wise, we should probably just assume
> the JSON parser is UTF-8 only.

Any conformant JSON parser should accept UTF-8, UTF-16, and UTF-36. I mean, it's

right there in the spec as a requirement, along with rocking horse people,
eating marshmelow pies.

> Before anyone goes hollering about that, we still have the HTTP layer
> to play with in terms of accepting content encoding. And nothing in
> the HTTP layer says we have to accept UC-4 or NR-17 or whatever. So
> while we're more than welcome to reject any request bodies way before
> they hit the JSON serializer, Noah would probably cut my throat for
> suggesting we don't play nice. Either way, this big conversation on
> character encodings should probably focus on how we move things to
> UTF-8 which I officially nominate as the already de-facto CouchDB
> character encoding.

If we support JSON, we either:

  * Support UTF-8, UTF-16, and UTF-36 per the insane spec.

  * Willfully ignore the spec and require UTF-8.

I vote for the first option.

> Assume UTF-8. If fail, maybe try guessing. If fail, throw a meatball
> at the client saying rejected. We already ignore quite a few headers
> and do things "Non-RESTful-ly" so I'm not too concerned.

No need to guess. The RFC has a mechanism for determining the encoding. It also
specifies a way to indicate the encoding. Which is totally not redundant, in any
way. Thankfully, it doesn't tell us how to handle conflicting information in
this respect, because of course, that would be too easy. Software is no fun when
the specifications make things obvious.

> > 2. How to treat a JSON request with a specified Content-Encoding.
> If the encoding is understood, transcode to a UTF-8 representation.

If it is explicit, and it is wrong, vomit in their face.

> > What encodings would be supported?
> Patches welcome. UTF-8 currently kinda sort supported.

UTF-8, UTF-16, and UTF-36

> > What would CouchDB do for an unsupported encoding?
> Tell the client that we don't support their weirdo character encoding
> and that patches are welcome at the CouchDB JIRA page that no one
> likes visiting cause Java is the devil. Maybe we don't mention that
> last bit though?


> > What would occur if the entity was not consistent with the encoding?
> If a client goes out of their way to specify a Content-Encoding and
> they send shit that doesn't comply then we should throw a huge pie at
> them and drop the connection. I'm thinking of a Nelson "Ha, ha!" and
> pointing of many fingers.

I would prefer vomiting, but any kind of humiliation works for me.

> > 3. What should CouchDB send when there is no "Accept-Charset" in the
> > request.
> UTF-8. Cause its yummy.

Yum yum for my tum.

> > 4. What should CouchDB send where there is an "Accept-Charset" in the
> > request.  Particularly if the request does not contain a UTF.
> >
> If we undersand it, transcode UTF-8 to the requested charset.
> Otherwise, say "Can't do it!".



Noah Slater,

View raw message