couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <randall.le...@gmail.com>
Subject Re: Why MD5 is used for hashes, also about non-deterministic IDs.
Date Mon, 14 Nov 2011 22:48:11 GMT
On Mon, Nov 14, 2011 at 10:52, Alex Besogonov <alex.besogonov@gmail.com> wrote:
> I'm looking at CouchDB source code and I have several questions:
>
> 1) Why MD5 is used instead of more secure hashes. It's very real to
> imagine a situation where a malicious user can cause hash collision
> and cause problems in replication.

Can you explain a little bit more where you see this interacting with
replication?

>
> 2) ID is not completely deterministic - it depends on
> compression_level and compressible_types settings for attachments.
> Would it make sense to use MD5 of the original uncompressed document?
> And while you're at it, it probably makes sense to include file size
> in Atts2 tuple.
>

Nothing in my mind requires that IDs be deterministic. It's useful for
reducing conflicts when identical changes are replayed on different
replicating couches, but it's not strictly required.

With respect to uncompressed file size, sometimes that information is
not available for attachments since they may have been send over the
wire in compressed form. We went over this conversation a few times
when adding compression features and it was decided that uncompressing
on the fly, server-side, just to get the uncompressed file size and
hash was not worth it.

Attachment records do have att_len and disk_len (sometimes the same,
depending on the encoding/compression during upload) properties and I
believe this is exposed in the _attachments metadata on document
requests. I don't know exactly what's changed since what release, so
it may not be visible on released version of CouchDB. Looking at the
code in master right now, I see "length", "encoded_length", and
"digest" included in the attachment metadata.

Thanks!
-Randall

Mime
View raw message