couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Tue, 30 Jun 2009 16:13:04 GMT
On Tue, Jun 30, 2009 at 11:46 AM, Damien Katz<> wrote:
> On Jun 30, 2009, at 11:22 AM, Noah Slater wrote:
>> On Tue, Jun 30, 2009 at 07:12:07AM -0400, Damien Katz wrote:
>>> Im not sure I understand why we can't just calculate and send the MD5
>>> header for the content range.
>> We could, but are you not proposing that we use this value for the
>> document
>> revision? If that is the case, when you do range requests, the hash sent
>> back
>> doesn't actually correspond to anything. If I used the hash from the final
>> range
>> request of a document to post an update, it would presumably fail.
> To clarify, the point of deterministic rev ids is only to avoid unnecessary
> conflicts when the identical edits are made on 2 different replicas. If the
> content was identical when editing the same revision, it should not be a
> conflict. If we had a canonical representation of the document, we could
> also use the determanistic rev ids for integrity checking, but we don't have
> a canonical representation, and creating one is very difficult to get right.

We most definitely do *not* need a format canonical JSON RFC to do
deterministic document revisions that can be used for integrity
checking.  The only thing a formal canonical specification allows is
for non-erlang clients to compute the deterministic revisions.

A side benefit of a formal canonicalization is that if it included
unicode and float normalization we would then be removing the
possibility of spurious conflicts that are the result of differences
in information *representation*. It's important to note that
non-normalized algorithms would still be quite capable of avoiding
many spurious conflicts, normalization just makes the chance of
spurious conflict (theoretically) zero.

In other words, we're more than free to just md5 the
term_to_binary(Json) and use that as a deterministic revision. Things
like sorting object members and normalization are just optimizations
to reduce conflicts when the information is constant but the
representation has changed.

Paul Davis

> What I'm proposing is that we only use content-MD5 for payload integrity
> checking. It will not being used for security and it cannot be validated
> against the rev id because they will always be different. The rev Id will be
> generated based on the erlang term format of the document, not the UTF8 JSON
> string that gets sent to the client.
> So the server will send it's responses (perhaps optionally) with a MD5 hash
> to detect packet corruption. Clients, when they send docs and attachments,
> can send the payload with a content-MD5 header and the server will check it
> to make sure it's uncorrupted. As it writes the data to disk the server will
> compute the MD5 hash, for it's own integrity checking later.
> So for example, the replicator will check the md5 sig from the server and
> send it's own md5 sig when writing data. This prevents network problems from
> introducing corruptions to data as it replicates.
> -Damien
>> Best,
>> --
>> Noah Slater,

View raw message