incubator-wave-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Blossom <>
Subject Re: Delving into OT
Date Tue, 25 Jun 2013 15:10:23 GMT
I may have responded similarly in another thread, if so, my apologies for
the redundancy. Michael is right - a more manageable structure would help
(and yes, Wave would be ideal).

My main thought is that I can see that data format is going to play a
critical role in the success of Wave if it is to be a mobile-first product.
SMTP expands data only by about a 2:1 ratio, so I cannot imagine that Wave
would be acceptable on the basis of mobile performance and mobile data
costs if data formats expand the underlying content by a 5:1 ratio. This is
not going to be an easy thing to balance out if we're to support
history/playback readily, but I suppose that's the fun of the challenge. If
we crack with with P2P in tow, then we've really got something.

All the best,

John Blossom

On Mon, Jun 24, 2013 at 6:36 PM, Joseph Gentle <> wrote:

> On Mon, Jun 24, 2013 at 2:30 PM, Ali Lown <> wrote:
> >> 1. CRDTs (like Logoot[1])
> >> 2. 'Classical' OT using vector clocks for versioning.
> >> 3. The OT system [...] similar to classical OT, except using git style
> >> version hashes.
> >
> > A quick look at Logoot's paper says that deletion is not generally
> > possible with CRDTs (p5 section 4) without either a) Tombstones or b)
> > vector clocks.
> >
> > So, I suspect it would be possible to make use of CRDTs (which seem
> > the best approach) with the git-style hash to fully allow deletion.
> As I understand it (good chance I'm wrong), using tombstones with
> CRDTs requires leaving in canonical names per tombstone. One of the
> nice aspects of normal OT is that tombstones don't need names. If you
> have "hello ..... world", you can represent that as ["hello ", 5, "
> world"]. That means that at worst the document has a tombstone (an
> integer) between every two characters.
> As I understand it, when they say "use vector clocks" they mean
> falling back to doing transformation for deletes. If we need
> transformation anyway, I don't know if there's much point using CDRTs.
> I had a chat with one of the original etherpad guys about this a few
> weeks ago. (David Greenspan, who also worked on wave for awhile and
> now does Meteor). He was talking about occasionally trying to sync all
> peers to purge junk crdt markers, and talking about how it defeated
> the point. He generally wasn't too impressed with CRDTs.
> And then there's the problem of how to edit a tree like JSON. As I
> said, I don't know if anyone knows how to do that yet.
> >> CRDTs are probably the most elegant solution. They avoid the entire
> >> complicated mess of transformation. However:
> >> - They add a significant overhead to the document size (we must store
> >> a unique name between every two characters in the wave)
> >> - As far as I know, nobody has implemented arbitrary JSON editing
> >> using CRDTs. Implementing a set is very difficult.
> >> - Even without transform, we'll still have to store operations anyway
> >> so peers can verify signatures.
> >>
> >> I'd like to know what we should expect the actual document size
> >> overhead to be. I expect it to be at least a 5x size increase, but I
> >> don't know for sure. The papers I've read cleverly minimize the
> >> overhead by only doing OT on a line-by-line basis, but thats
> >> unacceptable for wave.
> >
> > Really? Wave currently has special handling for <line></line> elements
> > in the document structure. (IIRC for the sole purpose of making its OT
> > easier - there was some reason to not put the document as
> > <line>data</line> but <line></line>data).
> Turns out, you can't split an element with OT. So if you have
> <line>data</line>, you can't break the line into two lines. So they
> use special <line /> elements as newline characters (that works
> great).
> When I say a line-by-line basis, I mean, you can't edit inside lines.
> With logoot, you can only insert and delete lines. To make that work
> for wave we would have to constantly replace lines as we edit them,
> and you couldn't have two people both editing the same line at the
> same time.
> Might work great for source code management though.
> > Many systems (e.g. *nix) take the idea of a line quite seriously since
> > they all favour text-based approaches.
> > The protocol shouldn't really be handling arbitrary binary data(should
> > it?), and JSON can easily be arbitrary newlined without a problem)
> No, you can't edit json using the same algorithm you use for text.
> Imagine if we have a list like this: [1]. I insert a second element
> (insert ",2") and you delete the 1. Now the JSON contains [,2], which
> is invalid.
> The right answer is to use a json-aware OT algorithm which understands
> lists and sets. Instead of inserting text into a json string, you use
> special list-insert, object-insert, etc operations. But as I say, I
> don't know if this is possible with CRDTs.
> >> [...] a plaintext implementation which does this in javascript[3],
> although
> >> its quite slow (~150k transforms per second). I expect that an
> >> optimized version of this will be around 10-20 million transforms /
> >> second in C[4] for simple ops, and probably a few million per second
> >> in java.
> >
> > Nice numbers! (Heh, can we see some actual data :) )
> My C implementation is stupid fast compared to JS. This is plaintext,
> not TP2. This is how op apply speed changes as the document size
> changes:
> Log scale:
> Linear scale:
> For transform, well:
> libot josephg$ ./test -b
> Benchmarking transform...
> run 0
> dl 10000 did 200000000 iterations in 1976.206000 ms: 101.204024 Miter/sec
> run 1
> dl 10000 did 200000000 iterations in 2012.051000 ms: 99.401059 Miter/sec
> ... thats a zero-allocation version with single character edits. But
> yeah, its a fair bit better than the 150k transforms/second that I'm
> doing in javascript at the moment.
> >> When peers connect, they send each other missing ops. Figuring out
> >> which ops are missing can be surprisingly tricky - but we'll figure
> >> that out later. New ops must be ingested in order, so we always ingest
> >> an operation after ingesting all of its parents.
> >
> > Alarm bells start ringing here... Care to elaborate on how we could
> > reasonably do this, it sounds difficult.
> I hear that. This problem alone has been bothering me for a few weeks.
> First, if one peer only has a subset of the operations of another
> peer, its easy. They both send their frontiers to each other, then the
> peer with more ops recursively collects every op back to the other
> peer's frontier and sends those.
> If they both have ops the other doesn't have, its harder. Note that it
> doesn't matter if you receive an op you already have again - you can
> figure that out easily because of the hashes & discard. And ops are
> small anyway. The big problem is round-trips - especially on mobile.
> I've brainstormed a few options:
> 1. They just batch ops & send groups of ~20k ops to each other until
> neither peer is missing anything
> 2. Peers could probably use bloom filters to figure out which ops to
> send. These are small (~1.2 *bits* per op) but they scale linearly
> with the number of ops in the document.
> 3. We could probably do some thing where peers send the ID of the last
> history elemtnt they have, the second last, the 4th last, 8th last,
> 16th last etc that they have. The other peer says what they're
> missing, and they send everything from that op onwards.
> >> This algorithm requires a lot of transforms, but I think most of them
> >> are unavoidable because of the signing / trust issue.
> >
> > Is there some way to use GPG-style signing chains to avoid having to
> > retransform the whole thing whilst maintaining trust?
> The problem is that you need to be able to read the operations to
> transform them. Either the server can read your ops (and hence your
> document, and so can the NSA) or your server can't read your ops and
> can't transform.
> >> It also requires
> >> that every operation (ie, every key press) has a unique ID. Thats
> >> probably going to be a 16 byte hash or something; which seems like a
> >> lot of extra bytes for typing one character.
> >
> > This is a client problem. There is not always a reason to make a
> > single character an operation.
> Agree. Actually, the algorithm has an advantage here compared to the
> old OT system: if you compose ops together, you can still transform
> against the composed result. (In the old system, there were weird edge
> cases where that would break OT).
> So if a document is at version 1000 and someone sends a bunch of ops
> at version 2, you could compose all ops from 2-1000 together, then
> transform the new ops by that composed blob and apply them there. That
> -might- be faster than transforming all the ops against all 998 other
> ops individually.
> -J
> >> Blah. That was a lot of words. Thoughts?
> >
> > See above. :)
> >
> > Ali

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message