couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jan Lehnardt (JIRA)" <j...@apache.org>
Subject [jira] Commented: (COUCHDB-988) Rewrite couch_key_tree.erl
Date Wed, 15 Dec 2010 00:18:01 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971493#action_12971493
] 

Jan Lehnardt commented on COUCHDB-988:
--------------------------------------

For posterity:

Damien's proposal to fix term_to_binary:

http://www.erlang.org/cgi-bin/ezmlm-cgi/3/387

The shot-down:

http://www.erlang.org/cgi-bin/ezmlm-cgi?3:msn:387

And the final OTP R13B release change log entry:

              OTP-7894  The term_to_binary/1 BIF used to be implemented with
	      recursive C code, which could cause the Erlang emulator to
	      terminate because of a stack overflow.

	      Also fixed some minor issues in term_to_binary/1 and
	      binary_to_term/1 pointed out by Matthew Dempsky.


> Rewrite couch_key_tree.erl
> --------------------------
>
>                 Key: COUCHDB-988
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-988
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core
>            Reporter: Paul Joseph Davis
>            Priority: Minor
>
> The current key tree module is a fairly complicated piece of code with enough subtlety
to fill three romance novels. This ticket is a proposal to rewrite the current module with
an algorithm that will be more easy to reason about. Here I'll write a brief explanation of
the current algorithms, and then a short proposal for a simpler algorithm.
> A key tree is used to represent the edit history of a document. Each node of the tree
represents a particular version of the document. Relations between nodes represent the order
that these edits were applied. For instance, a simple tree with a single linear path A->B->C
means that the edit C was based on the version B which was in turn based on A. In a world
without replication (and no ability to disable MVCC checks), all histories would be forced
to be linear lists of edits due to constraints imposed by MVCC (ie, new edits must be based
on the current version). However, we have replication, so we must deal with not so easy cases.
> Consider a document in state A. This doc is replicated to a second node. We then edit
the document one each node leaving it in two different states, B and C. We now have two key
trees, A->B and A->C. When we go to replicate a second time, the key tree must combine
these two trees which gives us A->(B|C). For the astute reader, this is how conflicts are
introduced. In terms of the key tree, we say that we have two leaves (B and C) that are not
deleted. Hence, conflict. To remove a conflict, one of the edits (B or C) can be deleted,
which results in, A->(B|C->D) where D is an edit that is specially marked with a deleted=true
flag.
> Also, of note which will help us down the line, remember that there's another completely
different possible mode of operation. Given two couchdb instances, we could *create* two different
docs with the same id. Now we have two documents with edit histories E and F. Now after replication
we have edit history (E|D) but really, this isn't a tree. Its two roots to two different trees.
Our algorithms still work, we just have to consider multiple trees when we apply them. Remember
this after this next aside.
> Hopefully from that brief description of simple operations its fairly intuitive how we
can end up with arbitrarily deep and branching trees due to distributed edits. Of note here
is that this operation parallels the nature of Git's commit model. Each commit sha1 depends
on the parent sha1 creating a sort of inverted merkle tree.
> At this point I'll mention a quick point on the implementation of these trees. In Erlang,
they are implemented as a list of nested tuples with the roots being the most shallow node.
For instance, A->B->C could be represented as roughly, [{A, [{B, [{C}]}]}] which in
english: "a single element list of a two element tuple, where the first element is the key,
and the second element is a single element list of a two tuple....turtles."
> The reason I make this note on implementation is that it should be apparent that the
depth of this data structure is going to be linearly dependent on the edit history of a document.
As we found out quickly, Erlang also apparently has some core C functions that are written
recursively. As everyone knows about recursion in C, there is a point where one more turtle
is one too many. Or, in other words, we run out of room on the stack to allocate more frames.
> At this point we had a couple options. Patch Erlang (we did, it was rejected IIRC). Change
our representation (could work, but would break some of the algorithmic assumptions, especially
with how Erlang's SSA works). Or, add pruning to the revision history. This is what revision
stemming is.
> Revision stemming in a nutshell means "Keep only the last N edits". But then its a tree,
so its actually, "keep the last N edits per leaf." The tradeoff here is that stemming may
introduce a conflict by accident. For instance, Given A->B on one node, replicate to a
second, edit it so that its A->B->C->D. Then applying stemming with an N of 2, we
end up with A->B on node 1, and C->D on node 2. After replication, this turns into (A->B|C->D)
ie, we weren't able to see that C was a descendant of B. This is generally considered an acceptable
tradeoff. You just need to make sure you replicate more often than the N rev_stemming edits.
> The second important note about rev_stemming is that it changes our storage of trees.
The current implementation keeps track of how many edits have been discarded. Ie, in revisions
you see in documents of the format "Number-Hash", that number is basically how deep that revision
is in the path from root. We then leverage this information when merging stemmed trees so
that we know how deep we need to look to start matching edits.
> One final minor piece of implementation. Each key in the tree also carries a value. This
can either be a document body about to be written, or perhaps the location in the file where
that revision can be read.
> The bug that was uncovered in COUCHDB-968 was that the start-of-tree position information
was used in such a way that we ended up choosing the wrong value. Ie, we were choosing to
re-write the same document instead of discarding it for the version that was already on disk.
This ended up cascading through about 3 critical sections causing the end result of dupes
in _changes and _all_docs after compaction.
> The fix that Adam has put together was to first fix the decision making process by making
a specific preference for which tree's value is used for identical revisions. Due to more
than a few subtleties this has take a bit to get right with all the intervening key tree code.
Hence, my desire to rewrite it.
> The last thing I'll write up is an off the cuff approach that I thought of this morning.
I don't necessarily think this is the only way to rewrite the key tree, but it seems easier
(before coding). If anyone has ideas for other approaches that's just as well as the only
goal here is to reduce the complexity factor.
> The current code was originally designed around the concept of rooted trees of edits.
Ie, Git. As we added features like stemming, the actual problem has shifted towards considering
a larger set of possibly unrelated trees. If we add one constraint, it is possible for us
to reconsider the algorithms in terms of more general graph approaches. That one extra constraint:
>     1. keys (ie, revisions) must be unique within the set of trees (ie, single docid).
> The general algorithm for merging N trees is something like:
>     1. For each tree, create a list of parent/child edges.
>     2. Find the union of all edges.
>     3. Rebuild the tree.
> There'd be more to work out of course. Depth would disappear, which would affect revision
patterns visible to the client. Though BigCouch has changed revisions as well. I'm sure there
are probably other things to work out.
> So, that's all I've got right now. Its a bit of a twinkle in the eye, but I figured I'd
get my thoughts on the current code down on paper so I don't have to re-think them in the
future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message