couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Fri, 26 Jun 2009 17:08:42 GMT
On Thu, Jun 25, 2009 at 5:37 PM, Damien Katz<> wrote:
> I am now working on an implementation of deterministic revs. After a lot of
> thinking about this, I've decided to not reuse the revision ids for
> integrity checking. The canonicalization problem is unresolved and using a
> CouchDB specific canonicalization means other libs/langs/platforms can't
> play easily with CouchDB replication.
> Integrity will be preserved by use of Content-MD5 when
> transferring/replicating documents, and checking the document hashing when
> reading from of disk. The replicator http client will check the integrity of
> the network bodies.
> If you need end-to-end integrity checking, you can use an application
> specific scheme to sign/hash various fields and attachments, if you can deal
> with the string and floating point canonicalization issues.
> My plan is that when generating new rev ids, CouchDB will deterministically
> generate the same revision id when edited with the same data. But it still
> is specific to the version of CouchDB and it's dependencies (version of
> Erlang, version of ICU, etc). It usually be the same across versions, but is
> not guaranteed.
> What this will allow is for a single client to send the same edits to 2
> identical Erlang servers and see the same revids generated on both.
> Optionally will allow that if 2 clients make byte identical saves for a
> document, they will get the same revision, and you don't need to return a
> conflict error the second client to save. I'm not sure about implementing
> this though.
> To implement this couchdb will store a md5 hash of the all the attachments
> along with the json document, when saving a new document we hash the native
> document and the attachment hashes together to generate the revision id.
> CouchDB will also store a md5 hash of the json document itself. This will
> give us disk integrity checking for all documents and their attachments in a
> database. When CouchdB encounters a corrupt document  or attachment it will
> stop what it's doing and return an error. The admin can restore from backup
> or recreate by deleting and re-replicating from a peer.
> I think this is the most pragmatic way to do deterministic revs and
> integrity checking. That is, do as little as possible and let others deal
> with the problems and implications of canonicalization if they want to to do
> end to end integrity checking.
> Feedback please.

One thing that strikes me as potentially bad is that the signature
can't be recalculated. Not sure if that's important or not.

> -Damien

View raw message