couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Joseph Davis (JIRA)" <>
Subject [jira] Created: (COUCHDB-988) Rewrite couch_key_tree.erl
Date Tue, 14 Dec 2010 19:47:02 GMT
Rewrite couch_key_tree.erl

                 Key: COUCHDB-988
             Project: CouchDB
          Issue Type: Improvement
          Components: Database Core
            Reporter: Paul Joseph Davis
            Priority: Minor

The current key tree module is a fairly complicated piece of code with enough subtlety to
fill three romance novels. This ticket is a proposal to rewrite the current module with an
algorithm that will be more easy to reason about. Here I'll write a brief explanation of the
current algorithms, and then a short proposal for a simpler algorithm.

A key tree is used to represent the edit history of a document. Each node of the tree represents
a particular version of the document. Relations between nodes represent the order that these
edits were applied. For instance, a simple tree with a single linear path A->B->C means
that the edit C was based on the version B which was in turn based on A. In a world without
replication (and no ability to disable MVCC checks), all histories would be forced to be linear
lists of edits due to constraints imposed by MVCC (ie, new edits must be based on the current
version). However, we have replication, so we must deal with not so easy cases.

Consider a document in state A. This doc is replicated to a second node. We then edit the
document one each node leaving it in two different states, B and C. We now have two key trees,
A->B and A->C. When we go to replicate a second time, the key tree must combine these
two trees which gives us A->(B|C). For the astute reader, this is how conflicts are introduced.
In terms of the key tree, we say that we have two leaves (B and C) that are not deleted. Hence,
conflict. To remove a conflict, one of the edits (B or C) can be deleted, which results in,
A->(B|C->D) where D is an edit that is specially marked with a deleted=true flag.

Also, of note which will help us down the line, remember that there's another completely different
possible mode of operation. Given two couchdb instances, we could *create* two different docs
with the same id. Now we have two documents with edit histories E and F. Now after replication
we have edit history (E|D) but really, this isn't a tree. Its two roots to two different trees.
Our algorithms still work, we just have to consider multiple trees when we apply them. Remember
this after this next aside.

Hopefully from that brief description of simple operations its fairly intuitive how we can
end up with arbitrarily deep and branching trees due to distributed edits. Of note here is
that this operation parallels the nature of Git's commit model. Each commit sha1 depends on
the parent sha1 creating a sort of inverted merkle tree.

At this point I'll mention a quick point on the implementation of these trees. In Erlang,
they are implemented as a list of nested tuples with the roots being the most shallow node.
For instance, A->B->C could be represented as roughly, [{A, [{B, [{C}]}]}] which in
english: "a single element list of a two element tuple, where the first element is the key,
and the second element is a single element list of a two tuple....turtles."

The reason I make this note on implementation is that it should be apparent that the depth
of this data structure is going to be linearly dependent on the edit history of a document.
As we found out quickly, Erlang also apparently has some core C functions that are written
recursively. As everyone knows about recursion in C, there is a point where one more turtle
is one too many. Or, in other words, we run out of room on the stack to allocate more frames.

At this point we had a couple options. Patch Erlang (we did, it was rejected IIRC). Change
our representation (could work, but would break some of the algorithmic assumptions, especially
with how Erlang's SSA works). Or, add pruning to the revision history. This is what revision
stemming is.

Revision stemming in a nutshell means "Keep only the last N edits". But then its a tree, so
its actually, "keep the last N edits per leaf." The tradeoff here is that stemming may introduce
a conflict by accident. For instance, Given A->B on one node, replicate to a second, edit
it so that its A->B->C->D. Then applying stemming with an N of 2, we end up with
A->B on node 1, and C->D on node 2. After replication, this turns into (A->B|C->D)
ie, we weren't able to see that C was a descendant of B. This is generally considered an acceptable
tradeoff. You just need to make sure you replicate more often than the N rev_stemming edits.

The second important note about rev_stemming is that it changes our storage of trees. The
current implementation keeps track of how many edits have been discarded. Ie, in revisions
you see in documents of the format "Number-Hash", that number is basically how deep that revision
is in the path from root. We then leverage this information when merging stemmed trees so
that we know how deep we need to look to start matching edits.

One final minor piece of implementation. Each key in the tree also carries a value. This can
either be a document body about to be written, or perhaps the location in the file where that
revision can be read.

The bug that was uncovered in COUCHDB-968 was that the start-of-tree position information
was used in such a way that we ended up choosing the wrong value. Ie, we were choosing to
re-write the same document instead of discarding it for the version that was already on disk.
This ended up cascading through about 3 critical sections causing the end result of dupes
in _changes and _all_docs after compaction.

The fix that Adam has put together was to first fix the decision making process by making
a specific preference for which tree's value is used for identical revisions. Due to more
than a few subtleties this has take a bit to get right with all the intervening key tree code.
Hence, my desire to rewrite it.

The last thing I'll write up is an off the cuff approach that I thought of this morning. I
don't necessarily think this is the only way to rewrite the key tree, but it seems easier
(before coding). If anyone has ideas for other approaches that's just as well as the only
goal here is to reduce the complexity factor.

The current code was originally designed around the concept of rooted trees of edits. Ie,
Git. As we added features like stemming, the actual problem has shifted towards considering
a larger set of possibly unrelated trees. If we add one constraint, it is possible for us
to reconsider the algorithms in terms of more general graph approaches. That one extra constraint:

    1. keys (ie, revisions) must be unique within the set of trees (ie, single docid).

The general algorithm for merging N trees is something like:

    1. For each tree, create a list of parent/child edges.
    2. Find the union of all edges.
    3. Rebuild the tree.

There'd be more to work out of course. Depth would disappear, which would affect revision
patterns visible to the client. Though BigCouch has changed revisions as well. I'm sure there
are probably other things to work out.

So, that's all I've got right now. Its a bit of a twinkle in the eye, but I figured I'd get
my thoughts on the current code down on paper so I don't have to re-think them in the future.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message