couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <>
Subject Re: Why MD5 is used for hashes, also about non-deterministic IDs.
Date Thu, 17 Nov 2011 22:57:52 GMT
On Wed, Nov 16, 2011 at 09:46, Alex Besogonov <> wrote:
> On Tue, Nov 15, 2011 at 4:23 PM, Randall Leeds <> wrote:
>>> Remember that the _rev value is derived from the contents of the
>>> documents, all the bytes of all attachments and values from previous
>>> revisions. Stock MD5 preimage attacks are of of much simpler form
>>> (finding a Y such that MD5(Y)=X for some desired X). Also that you
>>> would have to arrange for the same number of updates as well, since
>>> the number at the front is incremented on each successful update.
>> Also remember that the contents would have to parse as JSON, so that
>> restricts this search space even further.
> Not really. Binary representation of JSON is used to calculate the hash.
> So I can make a document like this:
> ===
> {
>  "aa" : "xxxxxxxxxxxxxx.....[several thousands x's]"
> }
> ===
> And use the large 'xxx...x' string as a scratch area for my attack. I don't
> even need to bother with quoting issues because CouchDB is going to
> unquote everything during JSON parsing. And there are no other hash
> codes to work around (working around even two MD5s at the same time
> is much harder).
> That's about the best possible case for an attacker.

This "attack", though, is still pretty hard, and, I think, not an
attack. The document _does_ have to take a trip through a JSON parser,
pass as valid JSON, but create an MD5 sum, along with the metadata,
that matches the revision id of the original document. All this needs
to be done on a Couch that is trusted to perform unfiltered,
bi-directional replication and allows the attacker to change documents
that matter to other people.

The proper way to stop the "attack" is to not let users modify
documents that will screw up things for other people. It's kind of
like how a UNIX user is _welcome_ to trash their .bashrc and just
because their home directory is mounted over NFS and now their .bashrc
is trashed _everywhere_ doesn't mean they've really done any damage
from anyone else's point of view. They didn't attack anything but


However. It's worth noting that an attacker can just make up whatever
revision identifiers they want to, without dealing with the MD5 stuff
anyway!!! Passing ?new_edits=false allows an "attacker" to specify
that a document has any revision they want, with whatever history of
revisions they want.

curl -XPUT -H"Content-Type: application/json"
-d'{"_id":"document", "_rev":"5-anything",

(Side note to devs: we may want to deterministically prune the leaves
for duplicates after merging rev trees, or not, because, well, this is
a crazy hand-crafted fake-out and caveat power-user.)

In fact, I just discovered yesterday that you can create unreachable
conflicts this way, by giving them revision ids and histories that
create two branches with identical leaves but different stems. If
CouchDB did decide to enforce some crypto-verifiable contraints on
revision ids, they could be checked to prevent this kind of
mis-history. However, other implementations would be forced to follow
the same scheme. I think the intention of making the revision ID
opaque was to make it an implementation detail and specifically _not_
a security or validation feature.

That said, I'm starting to come around to this idea. I'd be happy to
see patches that enable a "strict revisions mode" for CouchDB. I don't
feel like CouchDB has made any promises that are broken by using MD5,
but additional promises could possibly be made if we took a git-like
approach to revision crypto.

I hope that settles the "why", reassures any
"oh-my-god-my-couch-is-vulnerable", and motivates the
"hey-lets-make-a-patch" if you still want the feature, with the
understanding that it's unlikely the project will specify this as a
necessary condition for general-purpose replication. If you have more
bullet-proof needs, dev that armor up and I'll review it, but I'd
advise making it a config option.


>> Then, if I understand Jason
>> correctly, we're also talking about a situation where Couch B is
>> insecure... it's allowing a malicious user to change documents. If
>> these documents are anything more important than something affecting
>> the user herself then what you have is a malicious administrator or an
>> insecure deployment. I don't think MD5 is to blame here.
> No, the issue here is a possibility to break the synchronization.
>> Does that sound like a reasonable assessment to you, Alex?
> Almost.
>> Also, I'd love to hear about your C++ replicator as it develops.
> Sure, I'm developing a very small and fast embedded storage for mobile
> devices and desktop apps. It'll be open source once I finish its core.
>> -Randall
>>> For switching from MD5 to SHA-1, I say no. If we switch, let's use
>>> something contemporary like SHA-256. Better yet, let's wait for the
>>> winner of the SHA-3 competition.
>>> B.
>>> On 15 November 2011 07:57, Jason Smith <> wrote:
>>>> On Tue, Nov 15, 2011 at 7:34 AM, Alex Besogonov
>>>> <> wrote:
>>>>>>> Now I make a change to 'Doc' at machine A. This creates a new
>>>>>>> with new md5 hash.
>>>>>>> A malicious software somehow learns about this update and creates
>>>>>>> another document
>>>>>>> on machine B, contriving it so to make the resulting hash to
be the
>>>>>>> same as on machine A.
>>>>>> Before going any further, you must show why we care about the contents
>>>>>> of machine B.
>>>>>> Why would I log in to machine B if I do not trust B's owner? Why
>>>>>> I clone your Git repository if I do not know you?
>>>>> The problem is, MD5 hash depends on _untrusted_ data that external
>>>>> processes might put into the database.
>>>>> For example, imagine that machines A and B use CouchDB to store
>>>>> certificates.
>>>> I ask again.
>>>> --
>>>> Iris Couch

View raw message