couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Samuel Newson <>
Subject Re: [DISCUSS] Reduce on FDB take 3
Date Thu, 23 Jul 2020 09:17:52 GMT
Hi Adam,

You are right on those 3 bullet points, that is what ebtree is/does.

You are not right on your second paragraph, at least as it pertains to ebtree itself. ebtree
stores the inserted keys, values, links between btree nodes and intermediate reduction values
in the fdb value (the fdb key is always constructed as {IndexPrefix, NodeId} where NodeId
is currently an auto-incrementing integer). Some of those items are unused (stored as a single-byte
NIL) if there no reduce_fun is specified. In garren's work (
that integrates ebtree into the couch_views application you are right, in that ebtree is only
used for the reduce storage. I've commented on this topic that PR as I think that's a missed

Through the ebtree development I had considered whether the tree should be stored separately
from the user-defined (and therefore potentially large and non-uniform in size) values or
not. As you've noted the "order" in ebtree is implemented correctly (on cardinality not size)
and this is key to predictable performance (as is the rebalancing when deleting, which couch_btree
has never attempted).

Without the user-defined values, we could choose a high value for order (say, 100 or 200,
perhaps even higher) and have confidence that they will fit within fdb value size limits.
I deliberately went simple here and deferred the debate till I had something worth discussing.
And so I would love to continue that discussion here; this thread seems the appropriate place.

My thoughts so far are;

1) we could introduce new limits on what we allow users to reduce on, with a view to capture
95%+ of the intended use case. e.g, restricting the emitted values to scalars, which we know
will reduce without growing excessively (and, yes, I would include lists or objects of scalars).
The only thing I think we'd want to preclude is what reduce_limit also, and naively, tried
to prevent; an ever-growing reduction value.

2) As you've suggested independently, the emitted key/values, and the intermediate reductions,
are stored in their own k-v entries, leaving the k-v's of the btree structure itself within
predictable size bounds.

3) Splitting an encoded node over multiple, adjacent k-v's exactly like we do with a large
document body today.

4) A hybrid approach of the current ebtree code and 2 above, externalising key/value/reductions
if warranted on a per node basis. Well-behaved reduce functions over appropriate emitted data
would not need to do so, but we'd not be surprised in production deployments if large values
or large reductions happened on occasion.

I do think it's a good time to define what an appropriate use of reduce in CouchDB is, whatever
the mechanism for calculating and storing it. I don't think we should support egregious cases
like "function(ks, vs, rr) { return vs}", for example.

Finally, I note that I have a local branch called "immutable" which changes node ids whenever
the "members" attribute of an ebtree node changes. It changes node ids to uuids and adds a
cache callback. The intention is to eliminate the vast majority of inner node lookups _outside_
of any fdb transaction (it is always necessary to read the root node and we can't cache leaf
nodes as they are linked for efficient forward and reverse ordered traversal). This code works
as expected but I have been unable to prove its benefit as ebtree performs very well under
the test scenarios I've been able to bring to bear so far. I will post that work as a branch
to couchdb later today.


> On 23 Jul 2020, at 01:54, Adam Kocoloski <> wrote:
> This is pretty interesting. I find myself doing a mental double-take when I realize that
it is
> - storing an ordered set of KVs,
> - in an ordered KV storage engine,
> - without relying on the ordering of the storage engine!
> If I’m understanding things correctly (and that’s a big if), adding a reduce function
to a view completely changes how the “map” portion of that view is stored. A view with
no reduce function stores the rows of the view as individual KVs in FDB, but the same map
output when paired with a reduce function would be chunked up and stored in the leaf nodes
of this b+tree. Has anyone compared the performance of those two different data models, e.g.
by querying a map-only view and an mr-view with reduce=false?
> If it turns out that there’s a significant difference, is it worth considering a model
where the leaf nodes of the b+tree just point to KV ranges in FDB, rather than holding the
actual user-emitted KV data themselves? Then the inner nodes of the b+tree would just provide
the data structure to incrementally maintain the aggregations. Inserts to that range of KVs
during indexing would still go through the b+tree code path, but queries for the map portion
of the view could go directly to the relevant range of KVs in FDB, skipping the traversal
through the inner nodes and limiting the data transfer only to the rows requested by the client.
> Conversely, if the b+tree approach is actually better even without a user-supplied reduce
function, shouldn’t we use it for all views?
> As an aside, I’m very glad to see a departure from couch_btree’s approach of dynamically
modifying the “order” of the “b+tree” based on the size of the reduction. Such an
ugly edge case there ...
> Cheers, Adam
>> On Jul 21, 2020, at 8:48 AM, Robert Newson <> wrote:
>> Thank you for those kind words. 
>> -- 
>> Robert Samuel Newson
>> On Tue, 21 Jul 2020, at 13:45, Jan Lehnardt wrote:
>>> Heya Garren an Bob,
>>> this looks really nice. I remember when this was a twinkle in our
>>> planning eyes. Seeing the full thing realised is very very cool.
>>> I’m additionally impressed by the rather pretty and clean code.
>>> This doesn’t have to to be hard :)
>>> Looking forward to see this in action.
>>> Best
>>> Jan
>>> —
>>>> On 21. Jul 2020, at 14:01, Garren Smith <> wrote:
>>>> Hi All
>>>> We have a new reduce design for FoundationDB and we think this one will
>>>> work.
>>>> Recently I proposed a simpler reduce design [1] and at the same time, Bob
>>>> (rnewson) looked at implementing a B+tree [2], called ebtree, on top of
>>>> FoundationDB. The b+tree implementation has turned out really nicely, the
>>>> code is quite readable and works really well. I would like to propose that
>>>> instead of using the simpler reduce design I mentioned in the previous
>>>> email, we rather go with a reduce implementation on top of ebtree. The big
>>>> advantage of ebtree is that it allows us to keep the behaviour of CouchDB
>>>> 3.x.
>>>> We have run some basic performance tests on the Cloudant performance
>>>> clusters and so far the performance is looking quite good and performs very
>>>> similar to my simpler reduce work.
>>>> There is an unknown around the ebtree Order value. The Order is the number
>>>> of key/values stored for a node. We need to determine the optimal order
>>>> value for ebtree so that it doesn't exceed FoundationDB's key/value limits
>>>> and still performs well. This is something we will be looking at as we
>>>> finish up the reduce work. The work in progress for the reduce PR is
>>>> A great thanks to Bob for implementing the B+tree. I would love to hear
>>>> your thoughts or questions around this?
>>>> Cheers
>>>> Garren
>>>> [1]
>>>> [2]
>>>> [3]

View raw message