incubator-wave-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Gentle <jose...@gmail.com>
Subject Re: Delving into OT
Date Tue, 25 Jun 2013 02:20:20 GMT
Yep, probably right. - sounds good.

On Mon, Jun 24, 2013 at 5:53 PM, Michael MacFadden
<michael.macfadden@gmail.com> wrote:
> I will respond to this section by section in a bit.
>
> However my general comment is that there is a lot to digest here and I think we need
to start putting this stuff in a wiki page(s). I think it will get buried here. I think we
might need individual pages on each protocol idea. And the a pro/con page where we compare
them based on discussions on the list.
>
> ~Michael
>
> On Jun 24, 2013, at 1:06 PM, Joseph Gentle <josephg@gmail.com> wrote:
>
>> I want to start the discussion around what OT algorithms to use. This
>> is going to get technical. This is not a bug. Please ask simple but
>> tangential questions (like how does wave's current crypto system
>> work?) in a new thread instead of this one.
>>
>> Requirements:
>> - Arbitrary peers can syncronize data
>> - We can collaboratively edit tree structures, key-value pairs and
>> rich text. (I'd like arbitrary JSON editing)
>> - Fast enough to work on mobile devices. Protocol needs to be low
>> bandwidth and require a low number of network round-trips.
>> - Ideally a low storage overhead
>>
>> Nice-to-have:
>> - End-to-end encryption of content. (In the wake of PRISM, the less
>> information my wave server needs to know about me, the better.)
>> - Some wave version of OTR[1]
>> - Presence & cursors
>> - Simple. At least, as simple as we can manage. Lets at least try :/
>>
>> Is there anything else I'm missing? Some of this stuff is only
>> tangentially related to OT (like cursors and encryption), but its
>> still important we bring it up.
>>
>>
>> Given that we want to be able to connect arbitrary peers, I think it
>> makes sense to use a message encryption & signing system. Currently,
>> wave servers sign their user's messages. To support users
>> collaborating over a LAN without servers, we should allow users to
>> sign (& encrypt) their own messages. Unfortunately, the signature will
>> become invalid if the operation is transformed.
>>
>> Consider peers A and B (communicating over a LAN), and disconnected
>> peer C. A and B each generate simultaneous operations and sign their
>> operations. A sends its operation to B, who transforms it. B then
>> connects to C and sends both operations. For C to verify A's
>> operation, C must see the original (untransformed) version of the
>> operation. I think the only way we can make this work securely is by
>> only sending original operations instead of sending operations after
>> they've been transformed.
>>
>> As far as I know, there's three broad directions we could take algorithmically:
>> 1. CRDTs (like Logoot[1])
>> 2. 'Classical' OT using vector clocks for versioning.
>> 3. The OT system that Torben Weis was working on. (Some might remember
>> him from the wave summit a few years ago). I have been chatting on and
>> off with him about this since the summit. He hasn't published anything
>> on it yet though. Its similar to classical OT, except using git style
>> version hashes.
>>
>>
>> CRDTs are probably the most elegant solution. They avoid the entire
>> complicated mess of transformation. However:
>> - They add a significant overhead to the document size (we must store
>> a unique name between every two characters in the wave)
>> - As far as I know, nobody has implemented arbitrary JSON editing
>> using CRDTs. Implementing a set is very difficult.
>> - Even without transform, we'll still have to store operations anyway
>> so peers can verify signatures.
>>
>> I'd like to know what we should expect the actual document size
>> overhead to be. I expect it to be at least a 5x size increase, but I
>> don't know for sure. The papers I've read cleverly minimize the
>> overhead by only doing OT on a line-by-line basis, but thats
>> unacceptable for wave.
>>
>>
>> As I understand it, vector clocks are the size of the number of unique
>> peers which have ever connected to the network. (They contain an
>> integer version number per peer on the wave). As I understand them, a
>> vector clock must be stored in each operation - which to me always
>> seemed unpalatable enough I didn't explore them further. Note that
>> this is per *peer*, not per user. If we want full p2p, this is one per
>> device which has ever edited a wave.
>>
>> There's probably some clever ways we could minimize that size. Off the
>> top of my head:
>> - Most operations are generated on the same site as their single
>> parent. These operations only increment the local version number. We
>> could special case this. I don't know how much better this would make
>> the system - I suspect a lot.
>> - If we want, we could build a hybrid system where most clients use a
>> client-server protocol like the current system. Servers could then use
>> vector clocks or whatever to confer amongst themselves. This way we
>> only need one element in the clock per server. However, we lose most
>> of the benefits of having a p2p OT system to begin with.
>>
>>
>> Torben's system is the one I know the most about. I'm biased to think
>> thats the right answer because I understand it better than the others.
>> I just spent the last couple hours writing up how it works, but I'm
>> going to run it by Torben before posting to this list.
>>
>> It works like this:
>>
>> First, we need a transform function that supports TP2. The simplest
>> way to do this is to add tombstones to the data type. (We'll need to
>> do this for the vector clock based system as well, by the way). I have
>> a plaintext implementation which does this in javascript[3], although
>> its quite slow (~150k transforms per second). I expect that an
>> optimized version of this will be around 10-20 million transforms /
>> second in C[4] for simple ops, and probably a few million per second
>> in java.
>>
>> Second, make an inverse transform function (prune). P(T(a, b), b) = a.
>>
>> In its current incantation, the algorithm works as follows:
>> Consider a DAG of all the operations. The closest metaphor is a
>> network of operations in git. Every operation has a unique hash
>> identifier and one or more parents (except the root, which has no
>> parents). It'll look something like this:
>> https://dl.dropboxusercontent.com/u/2494815/ops.png
>> ... except it'll usually be waaaay less complicated than that.
>>
>> The cool thing about this structure is that if you apply all of the
>> operations in that graph in *any order*, you'll end up in the same
>> document state. (So long as each node comes after all its parents and
>> you use the transform function correctly).
>>
>> Each peer stores a history list of operations. Each operation is
>> stored twice. One copy is its original form (the exact bytes which
>> were used for hashing and signing). When sending the operation to a
>> remote peer, this is the copy we send. The second copy of the op is
>> transformed. At all times, the transformed ops in the history list are
>> written such that you could directly apply every operation in the
>> history list (in order) to a new document and you would reach the
>> current document state. The history lists on different nodes don't
>> have to be in the same order.
>>
>> We also store a copy of the current document snapshot, because its
>> really useful for UI.
>>
>> Consider the DAG again. Every operation in the history list is a node
>> in that DAG, with its parents as other nodes. The DAG will have at
>> least one node with no children (= the bottom nodes in that diagram I
>> linked above). These childless nodes form the *frontier*. When a new
>> operation is created, all ops in the frontier become the new node's
>> parents. The new node also stores the number of ancestors it has (=
>> history list length). I'll get to why thats useful that in a moment.
>>
>> The history list has a special power because of the prune function:
>> operations which don't depend on one another can be swapped in place.
>>
>> swap = (a, b) ->
>>  b' = prune(b, a)
>>  a' = transform(a, b)
>>
>>
>> When peers connect, they send each other missing ops. Figuring out
>> which ops are missing can be surprisingly tricky - but we'll figure
>> that out later. New ops must be ingested in order, so we always ingest
>> an operation after ingesting all of its parents.
>>
>> So you're given an op from another peer to ingest. We need to
>> transform that operation by all operations we have locally that aren't
>> in the op's ancestors. You can't just arbitrarily transform two ops by
>> one another - they have to be both be applyable to the same document.
>> So what we do is rewrite our local history so that all of the
>> operation's ancestors are at the start of our history list, and all of
>> the other operations are at the end of the history. Once we've done
>> that, the operation can be transformed by all the ops in the second
>> half of our history list and applied (and appended at the end of the
>> history list).
>>
>> Remember, we know how many ancestors the new operation has. We saved
>> that earlier when the op was created. We also have all of that
>> operation's parents in our history list. So we find the position of
>> the last parent the op has in the history list. If that operation is
>> at the same position as the new op's ancestor length, we're golden.
>>
>> Imagine your history list contains:
>> A B X Y
>>
>> and the peer you're connecting to has:
>> A B C
>>
>> (C has B as a parent). C has 2 ancestors (A and B). When you ingest C,
>> you find B (its last parent), which is at position 2 in the history
>> list. So you know that everything after that in the history list isn't
>> one of C's parents, and you can transform C by X, Y and then apply it.
>>
>> If our history list instead looks like this:
>>
>> X A B Y
>> ... then you'll find B is too late in your history list. You need to
>> rewrite your history list to move the X after B. You use the swap
>> function to do that, swapping X and A, then X and B. Your history then
>> looks like A B X Y like we wanted, so you can proceed as above.
>>
>> This algorithm requires a lot of transforms, but I think most of them
>> are unavoidable because of the signing / trust issue. It also requires
>> that every operation (ie, every key press) has a unique ID. Thats
>> probably going to be a 16 byte hash or something; which seems like a
>> lot of extra bytes for typing one character.
>>
>> I implemented it here:
>> https://github.com/josephg/tp2stuff/blob/master/node2.coffee
>>
>> ... though with the randomizer setup I have (a handful of peers, all
>> randomly generating operations and periodically (randomly) connecting
>> to each other to sync up.
>>
>> 5 clients generating operations, with two random clients syncing about
>> every 5 operations results in about 100x as many transforms as
>> operations. (10k ops -> 11M transforms). Doing the same thing with 3
>> clients 'only' needs 2M transforms. Performance stays steady over time
>> (it doesn't get worse as the document grows like CRDTs will) - this
>> scatterplot shows the number of seconds per 30 iterations on one of my
>> tests:
>> https://dl.dropboxusercontent.com/u/2494815/Screen%20Shot%202013-06-24%20at%201.01.42%20PM.png
>>
>> I think this huge number of transforms required is one of the costs of
>> doing encryption & signing of the ops. We might be able to lower the
>> number of transforms required if we trust our servers with our keys -
>> but I'm not sure I want to do that. Real-world performance should be
>> much better than that anyway, but yeah - it makes me nervous. I'm
>> probably going to run more tests to see what kind of performance we
>> can expect in practice.
>>
>>
>> Blah. That was a lot of words. Thoughts?
>> -J
>>
>>
>> [1] http://www.cypherpunks.ca/otr/
>> [2] http://hal.archives-ouvertes.fr/docs/00/34/59/11/PDF/main.pdf
>> [3] https://github.com/share/ottypes/blob/master/lib/text-tp2.js
>> [4] based on my extremely fast non-tombstone equivalent data structure
>> here: https://github.com/share/libot/blob/master/text.h

Mime
View raw message