couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Garren Smith <gar...@apache.org>
Subject [DISCUSS] Built-in reduce indexes
Date Tue, 23 Apr 2019 15:38:35 GMT
Hi All,

Following on from the map discussion, I want to start the discussion on
built-in reduce indexes.

## Builtin reduces
Builtin reduces are definitely the easier of the two reduce options to
reason about and design. The one factor to keep in mind for reduces is that
we need to be able reduce at different group levels. So a data model for
that would like this:

{?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?REDUCE,
<group_level>, <group_key>, <_reduce_function_name>} -> <aggregrate_value>}

Most of that is similar to the map data model, where it changes is from the
?REDUCE subspace, we add the group_level (from 1 -> number of keys emitted
in the map function), then the group key used in the reduce, the reduce
function name e.g _sum, _count and then we store the aggregated value as
the FDB value.

### Index management

To update the reduce indexes, we will rely on the `id_index` and the
`update_seq` defined in the map discussion. Then to apply changes,  we
calculate the change of an aggregate value for the keys at the highest
group level, then apply that change to all the group levels lower than it
using fdb’s atomic operations [1].

### Reducer functions

The FDB’s atomic functions support all the built in reduce functions
CouchDB supports. So we can use those as part of our red function. For the
`_stats` reduce function, we will have to split that across multiple key
values. So its data model will have an extra key in it to record what stat
it is for the _stats reducer:

{?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?REDUCE,
<group_level>, <group_key>, <_reduce_function_name>, <_stat_field>}
->
<aggregrate_value>}

We do have some problems, with `_approx_count_distinct`  because it does
not support removing keys from the filter. So we have three options:

1. We can ignore key removal entirely in the filter since this is just an
estimate
2.  Implement a real COUNT DISTINCT function, we can do because we’re not
trying to merge results from different local shards
3. Don’t support it going forward

### Group_level=0

One tricker situation is if a user does a group_level=0 query with a key
range, this would require us to do some client level aggregation. We would
have to get the aggregate values for a `group_level=1` for the supplied key
range and then aggregate those values together.

I would love to hear your thoughts, ideas on this?

If you are wondering about custom reduce indexes, I’m still working on that
and will start a discussion email on that a little later.

Cheers
Garren

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message