couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <dam...@apache.org>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Thu, 25 Jun 2009 21:37:21 GMT
I am now working on an implementation of deterministic revs. After a  
lot of thinking about this, I've decided to not reuse the revision ids  
for integrity checking. The canonicalization problem is unresolved and  
using a CouchDB specific canonicalization means other libs/langs/ 
platforms can't play easily with CouchDB replication.

Integrity will be preserved by use of Content-MD5 when transferring/ 
replicating documents, and checking the document hashing when reading  
from of disk. The replicator http client will check the integrity of  
the network bodies.

If you need end-to-end integrity checking, you can use an application  
specific scheme to sign/hash various fields and attachments, if you  
can deal with the string and floating point canonicalization issues.

My plan is that when generating new rev ids, CouchDB will  
deterministically generate the same revision id when edited with the  
same data. But it still is specific to the version of CouchDB and it's  
dependencies (version of Erlang, version of ICU, etc). It usually be  
the same across versions, but is not guaranteed.

What this will allow is for a single client to send the same edits to  
2 identical Erlang servers and see the same revids generated on both.  
Optionally will allow that if 2 clients make byte identical saves for  
a document, they will get the same revision, and you don't need to  
return a conflict error the second client to save. I'm not sure about  
implementing this though.

To implement this couchdb will store a md5 hash of the all the  
attachments along with the json document, when saving a new document  
we hash the native document and the attachment hashes together to  
generate the revision id.

CouchDB will also store a md5 hash of the json document itself. This  
will give us disk integrity checking for all documents and their  
attachments in a database. When CouchdB encounters a corrupt document   
or attachment it will stop what it's doing and return an error. The  
admin can restore from backup or recreate by deleting and re- 
replicating from a peer.

I think this is the most pragmatic way to do deterministic revs and  
integrity checking. That is, do as little as possible and let others  
deal with the problems and implications of canonicalization if they  
want to to do end to end integrity checking.

Feedback please.

-Damien

>


Mime
View raw message