couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Besogonov <>
Subject Re: Why MD5 is used for hashes, also about non-deterministic IDs.
Date Tue, 15 Nov 2011 04:41:23 GMT
On Mon, Nov 14, 2011 at 5:48 PM, Randall Leeds <> wrote:
>> I'm looking at CouchDB source code and I have several questions:
>> 1) Why MD5 is used instead of more secure hashes. It's very real to
>> imagine a situation where a malicious user can cause hash collision
>> and cause problems in replication.
> Can you explain a little bit more where you see this interacting with
> replication?
For example, imagine that I have two replicas on machine A and machine B with
document 'Doc' at the same initial state.

Now I make a change to 'Doc' at machine A. This creates a new revid
with new md5 hash.
A malicious software somehow learns about this update and creates
another document
on machine B, contriving it so to make the resulting hash to be the
same as on machine A.

So during replication machine B won't detect that a new version of the
document is
present and changes from machine A won't be replicated. Attack on MD5 achieving
this is quite possible today.

Now, it might not sound too threatening, but this attack breaks the
main invariant of
CouchDB - database replicas won't ever be eventually consistent!

Also, I'd like to use stronger hash just on general principles.

>> 2) ID is not completely deterministic - it depends on
>> compression_level and compressible_types settings for attachments.
>> Would it make sense to use MD5 of the original uncompressed document?
>> And while you're at it, it probably makes sense to include file size
>> in Atts2 tuple.
> Nothing in my mind requires that IDs be deterministic. It's useful for
> reducing conflicts when identical changes are replayed on different
> replicating couches, but it's not strictly required.
Yes, strictly deterministic IDs are not required, but it would be nice to have
a canonical form.

> With respect to uncompressed file size, sometimes that information is
> not available for attachments since they may have been send over the
> wire in compressed form. We went over this conversation a few times
> when adding compression features and it was decided that uncompressing
> on the fly, server-side, just to get the uncompressed file size and
> hash was not worth it.
Does it really have that much overhead? Usually only fairly small test/html/css
files are compressed. But okay, maybe at least a tag with compression level
and scheme could be attached?

> Attachment records do have att_len and disk_len (sometimes the same,
> depending on the encoding/compression during upload) properties and I
> believe this is exposed in the _attachments metadata on document
> requests.
I'm thinking about making it a part of the document revid.

View raw message