couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Incremental map/reduce
Date Fri, 30 Jan 2009 18:32:15 GMT
On Fri, Jan 30, 2009 at 2:18 AM, Brian Candler <B.Candler@pobox.com> wrote:
> Now, the other thing which I don't understand is group_level.

I think you're understanding of most of this well enough such that the
other links will get you the rest of the way there. I'll respond on
group_level because it is typically the most confusing (or at least it
was to me) until you realize how overwhelmingly simple it is.

Once you understand how normal reduce queries (with group=false) work,
eg: those that return a single reduction value for whatever key-range
you specify, group_level queries are not more complex. Group_level
queries are essentially a macro, which run one normal (group=false)
reduce query automatically for each interval on a set of intervals as
defined by the level.

So with group_level=1, and keys like

["a",1,1]
["a",3,4]
["a",3,8]
["b",2,6]
["b",2,6]
["c",1,5]
["c",4,2]

CouchDB will internally run 3 reduce queries for you. One that reduces
all rows where the first element of the key = "a", one for "b", and
one for "c".

If you were to query with group_level=2, you'd get a reduce query run
for each unique set of keys (according to their first two elements),
eg ["a",1], ["a",3], ["b",2"], ["c",1], ["c",4]

group=true is the conceptual equivalent of group_level=exact , so
CouchDB runs a reduce per unique key in the map row set.

I find that thinking of group_level and group=true as macros, which
are just running a series of group=false queries internally, clarifies
understanding and expectations about these features. For instance, in
a group=true query, since we see that Couch will run a reduce query
per unique key (in the overall key range, as specified by start and
end keys), we can expect the cost to be O(n) where n is the number of
unique keys in the range. I used to be surprised when group=true
queries were "slow" but now that I understand the mechanism it's hard
to see how they couldn't be.

In the future we may cache the final reduction values for group_level
queries in another index, which could speed up these queries when they
are run a second time, as well as potentially allowing for
sort-by-value queries to be done more efficiently. Then you'll be able
to ask what the most popular tags in a corpus are. Currently queries
like that need to be done by running group=true, then sorting on the
client.

Hope that helps!

-- 
Chris Anderson
http://jchris.mfdz.com

Mime
View raw message