couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Crowell <robccrow...@gmail.com>
Subject Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?
Date Tue, 15 Nov 2011 17:17:19 GMT
Hello everyone,

I'm writing some log-parsing code which is currently running on a
MySQL backend.  We're doing a huge amount of aggregates on this data
right now, but performance is suffering and I'm looking for
alternatives.  The idea of incremental map/reduce initially seemed
like the exact right thing, but I can't seem to express some of the
most important queries we are currently running in our production
system.

We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
"2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
pre-compute these aggregates (it checks modified timestamps and pulls
in only new records) and write them to a summary table.  I initially
believed I could use CouchDB's incremental map/reduce to effortlessly
build and update our "summary information" as it changes, but I'm
stuck.  I'm trying to relax, but I can't figure out exactly how :)

In our example, our user "boris" listens to the same song many times
each month, and we're interested in the number of distinct songs he's
listened to during a specified time period (NOT the number of song
plays, but the number of distinct songs played).  In CouchDB it isn't
much trouble to get all of the unique songs that he's listened to
during a period.  Here's our document:

{
  song_id: "happy birthday",
  user_id: "boris",
  date_played: [2011, 11, 14, 00, 12, 55],
  _id: ...
}

To get the unique values, all we need to do is emit([doc.user_id,
doc.date_played, doc.song_id], null), reduce with _count, and query
with a startkey=["boris", "2011-01-01"]&endkey=["boris",
"2011-02-01"]&group_level=1.  This query will yield results like:

["boris", "happy birthday"], 20
["boris", "yesterday"], 14
...

However, if our user has listened to 50,000 songs during the date
range, we'll get back 50,000 rows which seems expensive.  What I want
is just the scalar 50,000.  I've tried writing a reduce that returns
the set of distinct song_ids for each user (turning the values list
into a dictionary and back again), but CouchDB complains that I am not
reducing my values fast enough :-/  I'm also not sure how to reduce
this list to a scalar at the end without returning the whole thing to
my client (which defeats the purpose of all this anyways).

Is this possible to do in CouchDB today?  If not, is it something that
is on the roadmap, or does the internal structure of CouchDB's b-tree
make this really hard to do?  It would of course be possible for me to
implement this myself (subscribe to the update notifications and
update my counts as appropriate in a custom script), but I wanted to
move to CouchDB so that I wouldn't have to do all this myself.

Thanks for any advice!

--Rob

Mime
View raw message