Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 93382 invoked from network); 26 Jun 2009 11:08:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jun 2009 11:08:56 -0000 Received: (qmail 6760 invoked by uid 500); 26 Jun 2009 11:09:07 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 6667 invoked by uid 500); 26 Jun 2009 11:09:06 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 6657 invoked by uid 99); 26 Jun 2009 11:09:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Jun 2009 11:09:06 +0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.68.5.9] (HELO relay00.pair.com) (209.68.5.9) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 26 Jun 2009 11:08:54 +0000 Received: (qmail 18066 invoked from network); 26 Jun 2009 11:08:32 -0000 Received: from 75.143.234.216 (HELO ?192.168.1.197?) (75.143.234.216) by relay00.pair.com with SMTP; 26 Jun 2009 11:08:32 -0000 X-pair-Authenticated: 75.143.234.216 Message-Id: From: Damien Katz To: dev@couchdb.apache.org In-Reply-To: <7db9abd30906252232o43fa92c8p2e455555024bec51@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Unicode normalization (was Re: The 1.0 Thread) Date: Fri, 26 Jun 2009 07:08:32 -0400 References: <418F2A99-DC59-4BF3-B371-A2E07FD2C567@apache.org> <7db9abd30906252232o43fa92c8p2e455555024bec51@mail.gmail.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org Md5 here is for integrity purposes, not security, so manufactured collisions aren't a problem we are worried about. And I don't think there is standard SHA1 header, not that I could find anyway. -Damien On Jun 26, 2009, at 1:32 AM, kowsik wrote: > Please use SHA-1 because creating collisions with MD5 is trivial: > > http://web.archive.org/web/20070604205756/http://www.infosec.sdu.edu.cn/paper/md5-attack.pdf > http://www.mscs.dal.ca/~selinger/md5collision/ > > etc. > > Google for "md5 collision". Effectively, what this means that it's > easy to generate two documents that have the same MD5 hash. I'm sure > SHA-1 will be an issue at "some point in the future", but MD5 is > already broken from a hashing perspective. > > K. > > On Thu, Jun 25, 2009 at 2:37 PM, Damien Katz wrote: >> I am now working on an implementation of deterministic revs. After >> a lot of >> thinking about this, I've decided to not reuse the revision ids for >> integrity checking. The canonicalization problem is unresolved and >> using a >> CouchDB specific canonicalization means other libs/langs/platforms >> can't >> play easily with CouchDB replication. >> >> Integrity will be preserved by use of Content-MD5 when >> transferring/replicating documents, and checking the document >> hashing when >> reading from of disk. The replicator http client will check the >> integrity of >> the network bodies. >> >> If you need end-to-end integrity checking, you can use an application >> specific scheme to sign/hash various fields and attachments, if you >> can deal >> with the string and floating point canonicalization issues. >> >> My plan is that when generating new rev ids, CouchDB will >> deterministically >> generate the same revision id when edited with the same data. But >> it still >> is specific to the version of CouchDB and it's dependencies >> (version of >> Erlang, version of ICU, etc). It usually be the same across >> versions, but is >> not guaranteed. >> >> What this will allow is for a single client to send the same edits >> to 2 >> identical Erlang servers and see the same revids generated on both. >> Optionally will allow that if 2 clients make byte identical saves >> for a >> document, they will get the same revision, and you don't need to >> return a >> conflict error the second client to save. I'm not sure about >> implementing >> this though. >> >> To implement this couchdb will store a md5 hash of the all the >> attachments >> along with the json document, when saving a new document we hash >> the native >> document and the attachment hashes together to generate the >> revision id. >> >> CouchDB will also store a md5 hash of the json document itself. >> This will >> give us disk integrity checking for all documents and their >> attachments in a >> database. When CouchdB encounters a corrupt document or attachment >> it will >> stop what it's doing and return an error. The admin can restore >> from backup >> or recreate by deleting and re-replicating from a peer. >> >> I think this is the most pragmatic way to do deterministic revs and >> integrity checking. That is, do as little as possible and let >> others deal >> with the problems and implications of canonicalization if they want >> to to do >> end to end integrity checking. >> >> Feedback please. >> >> -Damien >> >>> >> >>