couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Crowell <>
Subject Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?
Date Tue, 15 Nov 2011 20:03:30 GMT
Hey Zach,

That looks really good!  Thanks for the pointer, I'd read about list
functions but never thought about using them in this way.

>From what I read, it seems there is no way to build this list function


On Tue, Nov 15, 2011 at 1:37 PM, Zachary Zolton
<> wrote:
> Rob
> Since you've already got a view that'll give you Boris's 50k unique
> songs, you could use a _list function to return the number of rows.
> Something like this should do the trick:
> function() {
>  var count = 0;
>  while(getRow()) count++;
>  return JSON.stringify({count: count});
> }
> If you query this list function, with the same view, key range and
> group level, it'll just respond with a bit of JSON, such as:
> {"count":"50612"}
> Is that more like what you're looking for?
> You can read up more here:
> --Zach
> On Tue, Nov 15, 2011 at 11:39 AM, Rob Crowell <> wrote:
>> I don't think this works, unless I am misunderstanding.
>> If our user "boris" listened to the same song 20 times, and only
>> listened to that one song, the _count reduce would return 20 would it
>> not?  I would like the value 1 instead (only 1 distinct song listened
>> to).
>> --Rob
>> On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <> wrote:
>>> then just emit(doc.user_id, null) and use _count?
>>> B.
>>> On 15 November 2011 17:17, Rob Crowell <> wrote:
>>>> Hello everyone,
>>>> I'm writing some log-parsing code which is currently running on a
>>>> MySQL backend.  We're doing a huge amount of aggregates on this data
>>>> right now, but performance is suffering and I'm looking for
>>>> alternatives.  The idea of incremental map/reduce initially seemed
>>>> like the exact right thing, but I can't seem to express some of the
>>>> most important queries we are currently running in our production
>>>> system.
>>>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
>>>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
>>>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
>>>> pre-compute these aggregates (it checks modified timestamps and pulls
>>>> in only new records) and write them to a summary table.  I initially
>>>> believed I could use CouchDB's incremental map/reduce to effortlessly
>>>> build and update our "summary information" as it changes, but I'm
>>>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>>>> In our example, our user "boris" listens to the same song many times
>>>> each month, and we're interested in the number of distinct songs he's
>>>> listened to during a specified time period (NOT the number of song
>>>> plays, but the number of distinct songs played).  In CouchDB it isn't
>>>> much trouble to get all of the unique songs that he's listened to
>>>> during a period.  Here's our document:
>>>> {
>>>>  song_id: "happy birthday",
>>>>  user_id: "boris",
>>>>  date_played: [2011, 11, 14, 00, 12, 55],
>>>>  _id: ...
>>>> }
>>>> To get the unique values, all we need to do is emit([doc.user_id,
>>>> doc.date_played, doc.song_id], null), reduce with _count, and query
>>>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
>>>> "2011-02-01"]&group_level=1.  This query will yield results like:
>>>> ["boris", "happy birthday"], 20
>>>> ["boris", "yesterday"], 14
>>>> ...
>>>> However, if our user has listened to 50,000 songs during the date
>>>> range, we'll get back 50,000 rows which seems expensive.  What I want
>>>> is just the scalar 50,000.  I've tried writing a reduce that returns
>>>> the set of distinct song_ids for each user (turning the values list
>>>> into a dictionary and back again), but CouchDB complains that I am not
>>>> reducing my values fast enough :-/  I'm also not sure how to reduce
>>>> this list to a scalar at the end without returning the whole thing to
>>>> my client (which defeats the purpose of all this anyways).
>>>> Is this possible to do in CouchDB today?  If not, is it something that
>>>> is on the roadmap, or does the internal structure of CouchDB's b-tree
>>>> make this really hard to do?  It would of course be possible for me to
>>>> implement this myself (subscribe to the update notifications and
>>>> update my counts as appropriate in a custom script), but I wanted to
>>>> move to CouchDB so that I wouldn't have to do all this myself.
>>>> Thanks for any advice!
>>>> --Rob

View raw message