incubator-wave-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael MacFadden <>
Subject Re: Delving into OT
Date Tue, 25 Jun 2013 05:19:49 GMT
See comments inline.

On 6/24/13 1:06 PM, "Joseph Gentle" <> wrote:

>I want to start the discussion around what OT algorithms to use. This
>is going to get technical. This is not a bug. Please ask simple but
>tangential questions (like how does wave's current crypto system
>work?) in a new thread instead of this one.
In general, I'll put sup a requirements page where we can collect these

>- Arbitrary peers can syncronize data
I agree with this in principle, however I think this may be to generic a
statement for wave. I think I would like to understand exactly what our
usage scenarios are.  Are we really aiming to have a document on my mobile
device that I share with you over bluetooth and we collaborate?  Is this a
requirement?  It may be, but I think the exact use cases that we target
may heavily impact tradeoffs we are willing to make. It's easy to say pure
P2P, but let's think about it.

>- We can collaboratively edit tree structures, key-value pairs and
>rich text. (I'd like arbitrary JSON editing)
I would extend this to Object Structures (which can be viewed as tree like)

>- Fast enough to work on mobile devices. Protocol needs to be low
>bandwidth and require a low number of network round-trips.
>- Ideally a low storage overhead
>- End-to-end encryption of content. (In the wake of PRISM, the less
>information my wave server needs to know about me, the better.)
>- Some wave version of OTR[1]
>- Presence & cursors
>- Simple. At least, as simple as we can manage. Lets at least try :/
>Is there anything else I'm missing? Some of this stuff is only
>tangentially related to OT (like cursors and encryption), but its
>still important we bring it up.
>Given that we want to be able to connect arbitrary peers, I think it
>makes sense to use a message encryption & signing system. Currently,
>wave servers sign their user's messages. To support users
>collaborating over a LAN without servers, we should allow users to
>sign (& encrypt) their own messages. Unfortunately, the signature will
>become invalid if the operation is transformed.
There is a paper on this I will try to find tonight.

>Consider peers A and B (communicating over a LAN), and disconnected
>peer C. A and B each generate simultaneous operations and sign their
>operations. A sends its operation to B, who transforms it. B then
>connects to C and sends both operations. For C to verify A's
>operation, C must see the original (untransformed) version of the
>operation. I think the only way we can make this work securely is by
>only sending original operations instead of sending operations after
>they've been transformed.
>As far as I know, there's three broad directions we could take
>1. CRDTs (like Logoot[1])
For CRDT look at this paper (It's a little more recent.):

>2. 'Classical' OT using vector clocks for versioning.
>3. The OT system that Torben Weis was working on. (Some might remember
>him from the wave summit a few years ago). I have been chatting on and
>off with him about this since the summit. He hasn't published anything
>on it yet though. Its similar to classical OT, except using git style
>version hashes.
>CRDTs are probably the most elegant solution. They avoid the entire
>complicated mess of transformation. However:
>- They add a significant overhead to the document size (we must store
>a unique name between every two characters in the wave)
>- As far as I know, nobody has implemented arbitrary JSON editing
>using CRDTs. Implementing a set is very difficult.
>- Even without transform, we'll still have to store operations anyway
>so peers can verify signatures.
>I'd like to know what we should expect the actual document size
>overhead to be. I expect it to be at least a 5x size increase, but I
>don't know for sure. The papers I've read cleverly minimize the
>overhead by only doing OT on a line-by-line basis, but thats
>unacceptable for wave.
>As I understand it, vector clocks are the size of the number of unique
>peers which have ever connected to the network.
Technically its only peers that have generated an operations, rather than
ones that have just connected.

> (They contain an
>integer version number per peer on the wave). As I understand them, a
>vector clock must be stored in each operation -
Yep.  Overhead.

>which to me always
>seemed unpalatable enough I didn't explore them further. Note that
>this is per *peer*, not per user. If we want full p2p, this is one per
>device which has ever edited a wave.
>There's probably some clever ways we could minimize that size. Off the
>top of my head:
>- Most operations are generated on the same site as their single
>parent. These operations only increment the local version number. We
>could special case this. I don't know how much better this would make
>the system - I suspect a lot.
>- If we want, we could build a hybrid system where most clients use a
>client-server protocol like the current system. Servers could then use
>vector clocks or whatever to confer amongst themselves. This way we
>only need one element in the clock per server. However, we lose most
>of the benefits of having a p2p OT system to begin with.
One proposed approach is to use the super node concept, where
neighborhoods of fully connected peers use vector clocks among them, and
then hierarchies of super-peers do the same. This keeps the size of vector
clocks at any given level small.  Albeit with added complexity.  That
said, quite a few P2P networks take this approach to deal with scalability
issues in general.

>Torben's system is the one I know the most about. I'm biased to think
>thats the right answer because I understand it better than the others.
>I just spent the last couple hours writing up how it works, but I'm
>going to run it by Torben before posting to this list.
> It works like this:
>First, we need a transform function that supports TP2.
A small grammatically clarification.  A transformation function is one
rule for transformation two specific operations.  An OT system is made up
of MANY transformation functions.  Not just one.  So an insert vs a
delete, a delete vs an update. Those would be two separate functions

>The simplest
>way to do this is to add tombstones to the data type. (We'll need to
>do this for the vector clock based system as well, by the way).
I disagree.  In the AnyUndo algorithm transformation functions support
IP1, CP1/TP1, CP2/TP2.  It uses a vector clock and I don't believe it uses

> I have
>a plaintext implementation which does this in javascript[3], although
>its quite slow (~150k transforms per second). I expect that an
>optimized version of this will be around 10-20 million transforms /
>second in C[4] for simple ops, and probably a few million per second
>in java.
>Second, make an inverse transform function (prune). P(T(a, b), b) = a.
>In its current incantation, the algorithm works as follows:
>Consider a DAG of all the operations. The closest metaphor is a
>network of operations in git. Every operation has a unique hash
>identifier and one or more parents (except the root, which has no
>parents). It'll look something like this:
>... except it'll usually be waaaay less complicated than that.
>The cool thing about this structure is that if you apply all of the
>operations in that graph in *any order*, you'll end up in the same
>document state. (So long as each node comes after all its parents and
>you use the transform function correctly).
>Each peer stores a history list of operations. Each operation is
>stored twice. One copy is its original form (the exact bytes which
>were used for hashing and signing). When sending the operation to a
>remote peer, this is the copy we send. The second copy of the op is
>transformed. At all times, the transformed ops in the history list are
>written such that you could directly apply every operation in the
>history list (in order) to a new document and you would reach the
>current document state. The history lists on different nodes don't
>have to be in the same order.
>We also store a copy of the current document snapshot, because its
>really useful for UI.
>Consider the DAG again. Every operation in the history list is a node
>in that DAG, with its parents as other nodes. The DAG will have at
>least one node with no children (= the bottom nodes in that diagram I
>linked above). These childless nodes form the *frontier*. When a new
>operation is created, all ops in the frontier become the new node's
>parents. The new node also stores the number of ancestors it has (=
>history list length). I'll get to why thats useful that in a moment.
>The history list has a special power because of the prune function:
>operations which don't depend on one another can be swapped in place.
>swap = (a, b) ->
>  b' = prune(b, a)
>  a' = transform(a, b)
>When peers connect, they send each other missing ops. Figuring out
>which ops are missing can be surprisingly tricky - but we'll figure
>that out later. New ops must be ingested in order, so we always ingest
>an operation after ingesting all of its parents.
>So you're given an op from another peer to ingest. We need to
>transform that operation by all operations we have locally that aren't
>in the op's ancestors. You can't just arbitrarily transform two ops by
>one another - they have to be both be applyable to the same document.
>So what we do is rewrite our local history so that all of the
>operation's ancestors are at the start of our history list, and all of
>the other operations are at the end of the history. Once we've done
>that, the operation can be transformed by all the ops in the second
>half of our history list and applied (and appended at the end of the
>history list).
I think I need to see an example of this to really grasp this.  It sound
interesting but also complex.  This point and the "which operations to
send" thing seems to be the main sticking point. These are not show
stoppers.  Just for me, I need to investigate a little more.

>Remember, we know how many ancestors the new operation has. We saved
>that earlier when the op was created. We also have all of that
>operation's parents in our history list. So we find the position of
>the last parent the op has in the history list. If that operation is
>at the same position as the new op's ancestor length, we're golden.
>Imagine your history list contains:
>A B X Y
>and the peer you're connecting to has:
>A B C
>(C has B as a parent). C has 2 ancestors (A and B). When you ingest C,
>you find B (its last parent), which is at position 2 in the history
>list. So you know that everything after that in the history list isn't
>one of C's parents, and you can transform C by X, Y and then apply it.
>If our history list instead looks like this:
>X A B Y
>... then you'll find B is too late in your history list. You need to
>rewrite your history list to move the X after B. You use the swap
>function to do that, swapping X and A, then X and B. Your history then
>looks like A B X Y like we wanted, so you can proceed as above.
Yes, my fear here is that in a true P2P system we keep rewriting the
history constantly.  I suppose your performance testing can mitigate that

>This algorithm requires a lot of transforms, but I think most of them
>are unavoidable because of the signing / trust issue. It also requires
>that every operation (ie, every key press) has a unique ID. Thats
>probably going to be a 16 byte hash or something; which seems like a
>lot of extra bytes for typing one character.
>I implemented it here:
>... though with the randomizer setup I have (a handful of peers, all
>randomly generating operations and periodically (randomly) connecting
>to each other to sync up.
>5 clients generating operations, with two random clients syncing about
>every 5 operations results in about 100x as many transforms as
>operations. (10k ops -> 11M transforms). Doing the same thing with 3
>clients 'only' needs 2M transforms. Performance stays steady over time
>(it doesn't get worse as the document grows like CRDTs will) - this
>scatterplot shows the number of seconds per 30 iterations on one of my
>I think this huge number of transforms required is one of the costs of
>doing encryption & signing of the ops. We might be able to lower the
>number of transforms required if we trust our servers with our keys -
>but I'm not sure I want to do that. Real-world performance should be
>much better than that anyway, but yeah - it makes me nervous. I'm
>probably going to run more tests to see what kind of performance we
>can expect in practice.
>Blah. That was a lot of words. Thoughts?

In generally I have a few more thoughts.  For corporate uses of this sort
of a system, the ability to have a single canonical version of something
is very important.  Also, the play back feature of wave was very important
as well to see how the document evolved over time. This is easy to do when
you have an authoritative server.  Not so easy when each client has its
own history.  Also, when thinking about saving a document to some
permanent storage there are some challenges.  With wave, again easy.
There server does it.  Here we would be talking about taking a global
snapshot across peers.  There are many algorithms that can help do this,
but they may or may not be what the end user wants (say for example if a
peer is disconnected, do their edits count? Do you have to wait till they
come back to save?).

In essence, what I am getting at here is that we should 100% explore this
system to understand it.  But I think we really think we need to ask the
question what are the requirements.  It MAY be that we need to come up
with some system that has the same set of operations and messaging but
that can work differently for different use cases.  Or we need to agree on
a common set of use cases to build to.

So in no way an I voting against this system.  It one of the more
interesting ideas I have seen as of late.  Just want to make sure what we
are aiming for.

>[4] based on my extremely fast non-tombstone equivalent data structure

View raw message