couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Tue, 23 Jun 2009 08:26:23 GMT
On Sun, Jun 21, 2009 at 11:21:00PM -0700, Chris Anderson wrote:
> A normal
> user is not going to understand the first bit of the fact that the
> underlying binary representation of their text could be subtly
> different in a way that would be invisible to them.
> Secondly, we're a database, so I find highly suspicious the notion
> that we should auto-normalize user input on-the-quiet.

Then maybe is it worth going the whole hog, and just storing the received
JSON directly to disk as a string? This takes out the JSON->erlang parsing
when storing documents, and the erlang->JSON serialisation when sending on
to the view server, or when retrieving documents for the client.

Of course, there is metadata which CouchDB adds, like _id, _rev etc. This
could be stored separately alongside the document, and then shoehorned in
when you retrieve the document (e.g. as simple as inserting some text after
the initial '{').

This gives some interesting future options: e.g. moving the metadata into
HTTP headers, at which point there is no requirement for the document to be
in JSON form at all. It just has to be in some format that the view server
is happy to parse.

As an aside: I support that subtly different encodings of the "same"
document (according to NFC) should have different revs, because (a) it's
unlikely that multiple different client implementations will be making the
same changes to the same documents (i.e. the clients in a cluster are likely
to be homogeneous), and (b) such conflicts are easy to resolve anyway.

> "don't mutilate strings you didn't edit" so as long as client software
> doesn't go jiggling forms to other random look-alike codepoints
> without asking, any potential trouble is confined to fields actually
> effected by an update.

It's probably not reasonable to make this requirement. Most client software
will deserialise JSON into some internal form (a Ruby hash, a Python dict,
or whatever), at which point transformations will take place, so turning it
back into JSON may well not give exactly the same serialisation. Ruby 1.8
won't even maintain the member ordering.

View raw message