couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Fri, 26 Jun 2009 11:08:32 GMT
Md5 here is for integrity purposes, not security, so manufactured  
collisions aren't a problem we are worried about. And I don't think  
there is standard SHA1 header, not that I could find anyway.


On Jun 26, 2009, at 1:32 AM, kowsik wrote:

> Please use SHA-1 because creating collisions with MD5 is trivial:
> etc.
> Google for "md5 collision". Effectively, what this means that it's
> easy to generate two documents that have the same MD5 hash. I'm sure
> SHA-1 will be an issue at "some point in the future", but MD5 is
> already broken from a hashing perspective.
> K.
> On Thu, Jun 25, 2009 at 2:37 PM, Damien Katz<> wrote:
>> I am now working on an implementation of deterministic revs. After  
>> a lot of
>> thinking about this, I've decided to not reuse the revision ids for
>> integrity checking. The canonicalization problem is unresolved and  
>> using a
>> CouchDB specific canonicalization means other libs/langs/platforms  
>> can't
>> play easily with CouchDB replication.
>> Integrity will be preserved by use of Content-MD5 when
>> transferring/replicating documents, and checking the document  
>> hashing when
>> reading from of disk. The replicator http client will check the  
>> integrity of
>> the network bodies.
>> If you need end-to-end integrity checking, you can use an application
>> specific scheme to sign/hash various fields and attachments, if you  
>> can deal
>> with the string and floating point canonicalization issues.
>> My plan is that when generating new rev ids, CouchDB will  
>> deterministically
>> generate the same revision id when edited with the same data. But  
>> it still
>> is specific to the version of CouchDB and it's dependencies  
>> (version of
>> Erlang, version of ICU, etc). It usually be the same across  
>> versions, but is
>> not guaranteed.
>> What this will allow is for a single client to send the same edits  
>> to 2
>> identical Erlang servers and see the same revids generated on both.
>> Optionally will allow that if 2 clients make byte identical saves  
>> for a
>> document, they will get the same revision, and you don't need to  
>> return a
>> conflict error the second client to save. I'm not sure about  
>> implementing
>> this though.
>> To implement this couchdb will store a md5 hash of the all the  
>> attachments
>> along with the json document, when saving a new document we hash  
>> the native
>> document and the attachment hashes together to generate the  
>> revision id.
>> CouchDB will also store a md5 hash of the json document itself.  
>> This will
>> give us disk integrity checking for all documents and their  
>> attachments in a
>> database. When CouchdB encounters a corrupt document  or attachment  
>> it will
>> stop what it's doing and return an error. The admin can restore  
>> from backup
>> or recreate by deleting and re-replicating from a peer.
>> I think this is the most pragmatic way to do deterministic revs and
>> integrity checking. That is, do as little as possible and let  
>> others deal
>> with the problems and implications of canonicalization if they want  
>> to to do
>> end to end integrity checking.
>> Feedback please.
>> -Damien

View raw message