incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?
Date Wed, 16 Nov 2011 21:40:18 GMT
Except that the OP specified the need to query for these counts within
a date range. So, you have to collate by listening time, not the song.

On Wed, Nov 16, 2011 at 2:45 PM, Marcello Nuccio
<marcello.nuccio@gmail.com> wrote:
> Hi Rob,
> I remember I've done something similar a while ago, but I cannot find
> the code right now and I don't have time to rewrite it right now...
> however the trick is to only count when the song name changes. This
> works because view rows are sorted.
>
> HTH,
>  Marcello
>
> 2011/11/15 Rob Crowell <robccrowell@gmail.com>:
>> Hello everyone,
>>
>> I'm writing some log-parsing code which is currently running on a
>> MySQL backend.  We're doing a huge amount of aggregates on this data
>> right now, but performance is suffering and I'm looking for
>> alternatives.  The idea of incremental map/reduce initially seemed
>> like the exact right thing, but I can't seem to express some of the
>> most important queries we are currently running in our production
>> system.
>>
>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
>> pre-compute these aggregates (it checks modified timestamps and pulls
>> in only new records) and write them to a summary table.  I initially
>> believed I could use CouchDB's incremental map/reduce to effortlessly
>> build and update our "summary information" as it changes, but I'm
>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>>
>> In our example, our user "boris" listens to the same song many times
>> each month, and we're interested in the number of distinct songs he's
>> listened to during a specified time period (NOT the number of song
>> plays, but the number of distinct songs played).  In CouchDB it isn't
>> much trouble to get all of the unique songs that he's listened to
>> during a period.  Here's our document:
>>
>> {
>>  song_id: "happy birthday",
>>  user_id: "boris",
>>  date_played: [2011, 11, 14, 00, 12, 55],
>>  _id: ...
>> }
>>
>> To get the unique values, all we need to do is emit([doc.user_id,
>> doc.date_played, doc.song_id], null), reduce with _count, and query
>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
>> "2011-02-01"]&group_level=1.  This query will yield results like:
>>
>> ["boris", "happy birthday"], 20
>> ["boris", "yesterday"], 14
>> ...
>>
>> However, if our user has listened to 50,000 songs during the date
>> range, we'll get back 50,000 rows which seems expensive.  What I want
>> is just the scalar 50,000.  I've tried writing a reduce that returns
>> the set of distinct song_ids for each user (turning the values list
>> into a dictionary and back again), but CouchDB complains that I am not
>> reducing my values fast enough :-/  I'm also not sure how to reduce
>> this list to a scalar at the end without returning the whole thing to
>> my client (which defeats the purpose of all this anyways).
>>
>> Is this possible to do in CouchDB today?  If not, is it something that
>> is on the roadmap, or does the internal structure of CouchDB's b-tree
>> make this really hard to do?  It would of course be possible for me to
>> implement this myself (subscribe to the update notifications and
>> update my counts as appropriate in a custom script), but I wanted to
>> move to CouchDB so that I wouldn't have to do all this myself.
>>
>> Thanks for any advice!
>>
>> --Rob
>>
>

Mime
View raw message