incubator-wave-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Blossom - Shore Communications Inc." <jblos...@shore.com>
Subject Re: Delving into OT
Date Thu, 27 Jun 2013 12:08:04 GMT
Sounds like we need a Hangout with an online whiteboard or such to work
this out...? John

On Tue, Jun 25, 2013 at 4:53 PM, Sam Nelson <sohra@orcon.net.nz> wrote:

> +1 for structured wiki pages.
>
> I re-read the original email a few times now and I still get a little bit
> lost around the bit of ancestor count - compare history lists A B C and D B
> X Y, when trying to apply X the ancestor count will match both, how does
> that work?  X will have a parent of B like C has a parent of B.  Both C and
> X have an ancestor count of 2.  Or have I mistaken this, in that it should
> be impossible for two peers to have operation B, without having identical
> history lists up to that point?
>
> I'll keep going through your tp2stuff as I have time and eventually, I
> might be able to contribute intelligently in some way (:
>
> -Sam
>
>
>
> On 25/06/2013 14:20, Joseph Gentle wrote:
>
>> Yep, probably right. - sounds good.
>>
>> On Mon, Jun 24, 2013 at 5:53 PM, Michael MacFadden
>> <michael.macfadden@gmail.com> wrote:
>>
>>> I will respond to this section by section in a bit.
>>>
>>> However my general comment is that there is a lot to digest here and I
>>> think we need to start putting this stuff in a wiki page(s). I think it
>>> will get buried here. I think we might need individual pages on each
>>> protocol idea. And the a pro/con page where we compare them based on
>>> discussions on the list.
>>>
>>> ~Michael
>>>
>>> On Jun 24, 2013, at 1:06 PM, Joseph Gentle <josephg@gmail.com> wrote:
>>>
>>>  I want to start the discussion around what OT algorithms to use. This
>>>> is going to get technical. This is not a bug. Please ask simple but
>>>> tangential questions (like how does wave's current crypto system
>>>> work?) in a new thread instead of this one.
>>>>
>>>> Requirements:
>>>> - Arbitrary peers can syncronize data
>>>> - We can collaboratively edit tree structures, key-value pairs and
>>>> rich text. (I'd like arbitrary JSON editing)
>>>> - Fast enough to work on mobile devices. Protocol needs to be low
>>>> bandwidth and require a low number of network round-trips.
>>>> - Ideally a low storage overhead
>>>>
>>>> Nice-to-have:
>>>> - End-to-end encryption of content. (In the wake of PRISM, the less
>>>> information my wave server needs to know about me, the better.)
>>>> - Some wave version of OTR[1]
>>>> - Presence & cursors
>>>> - Simple. At least, as simple as we can manage. Lets at least try :/
>>>>
>>>> Is there anything else I'm missing? Some of this stuff is only
>>>> tangentially related to OT (like cursors and encryption), but its
>>>> still important we bring it up.
>>>>
>>>>
>>>> Given that we want to be able to connect arbitrary peers, I think it
>>>> makes sense to use a message encryption & signing system. Currently,
>>>> wave servers sign their user's messages. To support users
>>>> collaborating over a LAN without servers, we should allow users to
>>>> sign (& encrypt) their own messages. Unfortunately, the signature will
>>>> become invalid if the operation is transformed.
>>>>
>>>> Consider peers A and B (communicating over a LAN), and disconnected
>>>> peer C. A and B each generate simultaneous operations and sign their
>>>> operations. A sends its operation to B, who transforms it. B then
>>>> connects to C and sends both operations. For C to verify A's
>>>> operation, C must see the original (untransformed) version of the
>>>> operation. I think the only way we can make this work securely is by
>>>> only sending original operations instead of sending operations after
>>>> they've been transformed.
>>>>
>>>> As far as I know, there's three broad directions we could take
>>>> algorithmically:
>>>> 1. CRDTs (like Logoot[1])
>>>> 2. 'Classical' OT using vector clocks for versioning.
>>>> 3. The OT system that Torben Weis was working on. (Some might remember
>>>> him from the wave summit a few years ago). I have been chatting on and
>>>> off with him about this since the summit. He hasn't published anything
>>>> on it yet though. Its similar to classical OT, except using git style
>>>> version hashes.
>>>>
>>>>
>>>> CRDTs are probably the most elegant solution. They avoid the entire
>>>> complicated mess of transformation. However:
>>>> - They add a significant overhead to the document size (we must store
>>>> a unique name between every two characters in the wave)
>>>> - As far as I know, nobody has implemented arbitrary JSON editing
>>>> using CRDTs. Implementing a set is very difficult.
>>>> - Even without transform, we'll still have to store operations anyway
>>>> so peers can verify signatures.
>>>>
>>>> I'd like to know what we should expect the actual document size
>>>> overhead to be. I expect it to be at least a 5x size increase, but I
>>>> don't know for sure. The papers I've read cleverly minimize the
>>>> overhead by only doing OT on a line-by-line basis, but thats
>>>> unacceptable for wave.
>>>>
>>>>
>>>> As I understand it, vector clocks are the size of the number of unique
>>>> peers which have ever connected to the network. (They contain an
>>>> integer version number per peer on the wave). As I understand them, a
>>>> vector clock must be stored in each operation - which to me always
>>>> seemed unpalatable enough I didn't explore them further. Note that
>>>> this is per *peer*, not per user. If we want full p2p, this is one per
>>>> device which has ever edited a wave.
>>>>
>>>> There's probably some clever ways we could minimize that size. Off the
>>>> top of my head:
>>>> - Most operations are generated on the same site as their single
>>>> parent. These operations only increment the local version number. We
>>>> could special case this. I don't know how much better this would make
>>>> the system - I suspect a lot.
>>>> - If we want, we could build a hybrid system where most clients use a
>>>> client-server protocol like the current system. Servers could then use
>>>> vector clocks or whatever to confer amongst themselves. This way we
>>>> only need one element in the clock per server. However, we lose most
>>>> of the benefits of having a p2p OT system to begin with.
>>>>
>>>>
>>>> Torben's system is the one I know the most about. I'm biased to think
>>>> thats the right answer because I understand it better than the others.
>>>> I just spent the last couple hours writing up how it works, but I'm
>>>> going to run it by Torben before posting to this list.
>>>>
>>>> It works like this:
>>>>
>>>> First, we need a transform function that supports TP2. The simplest
>>>> way to do this is to add tombstones to the data type. (We'll need to
>>>> do this for the vector clock based system as well, by the way). I have
>>>> a plaintext implementation which does this in javascript[3], although
>>>> its quite slow (~150k transforms per second). I expect that an
>>>> optimized version of this will be around 10-20 million transforms /
>>>> second in C[4] for simple ops, and probably a few million per second
>>>> in java.
>>>>
>>>> Second, make an inverse transform function (prune). P(T(a, b), b) = a.
>>>>
>>>> In its current incantation, the algorithm works as follows:
>>>> Consider a DAG of all the operations. The closest metaphor is a
>>>> network of operations in git. Every operation has a unique hash
>>>> identifier and one or more parents (except the root, which has no
>>>> parents). It'll look something like this:
>>>> https://dl.dropboxusercontent.**com/u/2494815/ops.png<https://dl.dropboxusercontent.com/u/2494815/ops.png>
>>>> ... except it'll usually be waaaay less complicated than that.
>>>>
>>>> The cool thing about this structure is that if you apply all of the
>>>> operations in that graph in *any order*, you'll end up in the same
>>>> document state. (So long as each node comes after all its parents and
>>>> you use the transform function correctly).
>>>>
>>>> Each peer stores a history list of operations. Each operation is
>>>> stored twice. One copy is its original form (the exact bytes which
>>>> were used for hashing and signing). When sending the operation to a
>>>> remote peer, this is the copy we send. The second copy of the op is
>>>> transformed. At all times, the transformed ops in the history list are
>>>> written such that you could directly apply every operation in the
>>>> history list (in order) to a new document and you would reach the
>>>> current document state. The history lists on different nodes don't
>>>> have to be in the same order.
>>>>
>>>> We also store a copy of the current document snapshot, because its
>>>> really useful for UI.
>>>>
>>>> Consider the DAG again. Every operation in the history list is a node
>>>> in that DAG, with its parents as other nodes. The DAG will have at
>>>> least one node with no children (= the bottom nodes in that diagram I
>>>> linked above). These childless nodes form the *frontier*. When a new
>>>> operation is created, all ops in the frontier become the new node's
>>>> parents. The new node also stores the number of ancestors it has (=
>>>> history list length). I'll get to why thats useful that in a moment.
>>>>
>>>> The history list has a special power because of the prune function:
>>>> operations which don't depend on one another can be swapped in place.
>>>>
>>>> swap = (a, b) ->
>>>>   b' = prune(b, a)
>>>>   a' = transform(a, b)
>>>>
>>>>
>>>> When peers connect, they send each other missing ops. Figuring out
>>>> which ops are missing can be surprisingly tricky - but we'll figure
>>>> that out later. New ops must be ingested in order, so we always ingest
>>>> an operation after ingesting all of its parents.
>>>>
>>>> So you're given an op from another peer to ingest. We need to
>>>> transform that operation by all operations we have locally that aren't
>>>> in the op's ancestors. You can't just arbitrarily transform two ops by
>>>> one another - they have to be both be applyable to the same document.
>>>> So what we do is rewrite our local history so that all of the
>>>> operation's ancestors are at the start of our history list, and all of
>>>> the other operations are at the end of the history. Once we've done
>>>> that, the operation can be transformed by all the ops in the second
>>>> half of our history list and applied (and appended at the end of the
>>>> history list).
>>>>
>>>> Remember, we know how many ancestors the new operation has. We saved
>>>> that earlier when the op was created. We also have all of that
>>>> operation's parents in our history list. So we find the position of
>>>> the last parent the op has in the history list. If that operation is
>>>> at the same position as the new op's ancestor length, we're golden.
>>>>
>>>> Imagine your history list contains:
>>>> A B X Y
>>>>
>>>> and the peer you're connecting to has:
>>>> A B C
>>>>
>>>> (C has B as a parent). C has 2 ancestors (A and B). When you ingest C,
>>>> you find B (its last parent), which is at position 2 in the history
>>>> list. So you know that everything after that in the history list isn't
>>>> one of C's parents, and you can transform C by X, Y and then apply it.
>>>>
>>>> If our history list instead looks like this:
>>>>
>>>> X A B Y
>>>> ... then you'll find B is too late in your history list. You need to
>>>> rewrite your history list to move the X after B. You use the swap
>>>> function to do that, swapping X and A, then X and B. Your history then
>>>> looks like A B X Y like we wanted, so you can proceed as above.
>>>>
>>>> This algorithm requires a lot of transforms, but I think most of them
>>>> are unavoidable because of the signing / trust issue. It also requires
>>>> that every operation (ie, every key press) has a unique ID. Thats
>>>> probably going to be a 16 byte hash or something; which seems like a
>>>> lot of extra bytes for typing one character.
>>>>
>>>> I implemented it here:
>>>> https://github.com/josephg/**tp2stuff/blob/master/node2.**coffee<https://github.com/josephg/tp2stuff/blob/master/node2.coffee>
>>>>
>>>> ... though with the randomizer setup I have (a handful of peers, all
>>>> randomly generating operations and periodically (randomly) connecting
>>>> to each other to sync up.
>>>>
>>>> 5 clients generating operations, with two random clients syncing about
>>>> every 5 operations results in about 100x as many transforms as
>>>> operations. (10k ops -> 11M transforms). Doing the same thing with 3
>>>> clients 'only' needs 2M transforms. Performance stays steady over time
>>>> (it doesn't get worse as the document grows like CRDTs will) - this
>>>> scatterplot shows the number of seconds per 30 iterations on one of my
>>>> tests:
>>>> https://dl.dropboxusercontent.**com/u/2494815/Screen%20Shot%**
>>>> 202013-06-24%20at%201.01.42%**20PM.png<https://dl.dropboxusercontent.com/u/2494815/Screen%20Shot%202013-06-24%20at%201.01.42%20PM.png>
>>>>
>>>> I think this huge number of transforms required is one of the costs of
>>>> doing encryption & signing of the ops. We might be able to lower the
>>>> number of transforms required if we trust our servers with our keys -
>>>> but I'm not sure I want to do that. Real-world performance should be
>>>> much better than that anyway, but yeah - it makes me nervous. I'm
>>>> probably going to run more tests to see what kind of performance we
>>>> can expect in practice.
>>>>
>>>>
>>>> Blah. That was a lot of words. Thoughts?
>>>> -J
>>>>
>>>>
>>>> [1] http://www.cypherpunks.ca/otr/
>>>> [2] http://hal.archives-ouvertes.**fr/docs/00/34/59/11/PDF/main.**pdf<http://hal.archives-ouvertes.fr/docs/00/34/59/11/PDF/main.pdf>
>>>> [3] https://github.com/share/**ottypes/blob/master/lib/text-**tp2.js<https://github.com/share/ottypes/blob/master/lib/text-tp2.js>
>>>> [4] based on my extremely fast non-tombstone equivalent data structure
>>>> here: https://github.com/share/**libot/blob/master/text.h<https://github.com/share/libot/blob/master/text.h>
>>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message