couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Mon, 22 Jun 2009 22:43:40 GMT
On Mon, Jun 22, 2009 at 1:22 PM, Paul Davis<paul.joseph.davis@gmail.com> wrote:
> On Mon, Jun 22, 2009 at 3:32 PM, Noah Slater<nslater@apache.org> wrote:
>> On Mon, Jun 22, 2009 at 03:15:24PM -0400, Paul Davis wrote:

>
> Exactly, though I would add a third choice that is
>
>  * calculate the document hash from the deterministic binary serialization

I think this is the 90% solution. Unicode normalization may be the
other 10%. I just don't want to see the last 10% block the first 90.

>
> Which would include requirements like serializing document members
> with some defined ordering.
>
> On a side note, I've also contemplated just hashing the incoming
> binary representation as the new revision. Though that comes with its
> own set of issues obviously.
>

I don't have anything against unicode normalization, I just don't
think it buys us a *whole* lot. I do think recursive sorting and
deterministic float handling are pretty crucial, as even the same Ruby
client will order the keys differently on subsequent PUTs. Just
hashing the PUT body would not be sufficient, I think. It'd be more
like a 50% solution.

Here's some code a wrote a while ago that handles floats and key sorting in JS:

http://github.com/jchris/canonical-json/blob/83751a8b650c60a5fcf3ed4ad5337e3dd172b521/test.html

The client should not send the string used for hashing as the document
itself, the hash would be made from a string derived from the
document, which would be lossy on floats. In Couch we'd want to do
this in Erlang, and not store any effects of the function except the
hash.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message