couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Fri, 26 Jun 2009 12:17:30 GMT
On Fri, Jun 26, 2009 at 4:21 AM, Damien Katz<> wrote:
> On Jun 25, 2009, at 6:53 PM, Noah Slater wrote:
>> On Thu, Jun 25, 2009 at 05:37:21PM -0400, Damien Katz wrote:
>>> Integrity will be preserved by use of Content-MD5
>> Bike shed: what about the stronger SHA family of hashes?
> Content-MD5 is standard header, I can find no others headers to do integrity
> hashing.
>>> But it still is specific to the version of CouchDB and it's dependencies
>>> (version of Erlang, version of ICU, etc). It usually be the same across
>>> versions, but is not guaranteed.
>> If we're doing content hashing, why would this matter?
> Because we don't have a formal canonical format, so we aren't even trying.
> We'll be hashing whatever representation we have in-memory, and that could
> change version to version.
>>> Optionally will allow that if 2 clients make byte identical saves for a
>>> document, they will get the same revision, and you don't need to return a
>>> conflict error the second client to save.
>> Are there any security issues around possible hash collisions?
> No, we aren't checking them later.

This all sounds very sensible to me.

On the security note: the only time I can see a manufactured hash
collision mattering is if you know that 2 nodes will eventually
replicate, and you provide a poison version of a document to one of
them. In that case what would be the behavior?

Pragmatically, it might be worth checking to see if most client
(firefox, ruby, etc) will save the same document with different binary
representations to 2 different servers.

The easiest way to fix this, without going all the through the looking
glass, would be to recursively sort any eJSON proplists we encounter
in the #doc.body. This way at least the most common case of
should-be-the-same-but-isn't would be protected against, without
adding much weight to the implementation.

The argument against this is that it's a slippery slope. If we don't
slide down it, it's not bad. If we slide, then we end up in the
canonicalization muck, so let's not slide. If we think we can't help
but slide, then forget I mentioned it, and put the onus on clients to
provide the same binary data when they want the same rev.

Overall the punting is in the right place, Damien. Good call.

>>> I think this is the most pragmatic way to do deterministic revs and
>>> integrity
>>> checking. That is, do as little as possible and let others deal with the
>>> problems and implications of canonicalization if they want to to do end
>>> to end
>>> integrity checking.
>> Seems like a reasonable approach to me.
>> Best,
>> --
>> Noah Slater,

Chris Anderson

View raw message