couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Fair <mich...@daclubhouse.net>
Subject Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)
Date Wed, 23 Mar 2016 00:30:25 GMT
Greetings CouchDBers!

I've been modifying a BERT library to recreate the md5 calc of a RevisionID
in Java.

I haven't tackled attachments yet, however with the awesome help of rnewson
on the IRC channel, I've succeeded in recreating the md5 for all the
documents I've tried so far which includes docs with values of strings, big
and small integers, lists of big integers, lists of small integers, true,
false, null, and objects; however the glaring exception is floats.

The {minor_version, 0} format used for floats (A 31 byte string based
representation in %.20e format) is dependent on the host environment doing
the encoding and can't be reliably duplicated in other machines and
languages.

For instance, here are examples of encoding 3.14159 as %.20e string on this
laptop:
erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
python: 3.14158999999999988262e+00
java:   3.14159000000000000000e+00

These minor numerical differences unfortunately make the md5 computation
untenable.  And further, it seems that even different OTP versions and
different hardware will encode the {minor_version, 0} format slightly
differently on different Couch instances (A couple people on IRC shared
with me what their OTP produced).


To make a long story short and spare folks reading the mind-numbing
details, without changing something, replicating the md5 for the revision
id of documents with floats just can't be done sanely.

As things are now, like I mentioned, even different installations of
CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}.


So where does this create an issue?

It shows up by creating a conflict document during replication when the two
servers calculated different revision ids for the same document update
(which only happens if it was a multi-master update (an update where both
sides were updated before replicating -- like separate laptops on separate
planes each doing the same thing)).

If only one side or the other was updated, it doesn't cause a problem.

My goal is enabling people to upload documents from multiple server
applications using JSON and Couch to handle the replication bits.

To give this heterogeneous environment the same multi-master intelligence
that Couch has, they need to be able to compute the same revision id that
Couch would compute; otherwise documents modified directly in couch could
create these kinds of multi-master type conflicts.


----

What to do (aside from simply do nothing)?

At the least I recommend changing the term_to_binary computation to use the
{minor_version, 1} option in the rev_id calculation.

This changes how floats are encoded to the 64-bit IEEE format.  It became
the standard way of encoding floats in OTP 17.0+ and is available as an
option all the way back to OTP 11.  As long as it's explicitly provided as
a requested option in the term_to_binary call, all currently deployed OTP
installations for Couch can do it.

Doing this normalizes the md5 calculation for floats regardless of the OTP
platform, and should make it feasible for third party applications to
replicate the encoding.



I have some other ideas beyond that, but they would require changes to the
replication protocol to support.


----

For anyone interested I'd be happy to share the code I have.  It's still a
bit rough in the document construction part, but once constructed, getting
the binary encoding and revision id are each just a single call.


Thanks,
Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message