From user-return-18867-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Wed Nov 16 20:45:51 2011 Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8FDE99DB2 for ; Wed, 16 Nov 2011 20:45:51 +0000 (UTC) Received: (qmail 23999 invoked by uid 500); 16 Nov 2011 20:45:50 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 23967 invoked by uid 500); 16 Nov 2011 20:45:49 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 23958 invoked by uid 99); 16 Nov 2011 20:45:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Nov 2011 20:45:49 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of marcello.nuccio@gmail.com designates 209.85.215.180 as permitted sender) Received: from [209.85.215.180] (HELO mail-ey0-f180.google.com) (209.85.215.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Nov 2011 20:45:42 +0000 Received: by eyg5 with SMTP id 5so1272759eyg.11 for ; Wed, 16 Nov 2011 12:45:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=Bpe2871W9ZfBCjMUw6r0ziCwkyBe7HKPgmIYFECWIjU=; b=JEGPcm48pf/FpMSHeoZZ32lCorxZc72xzulkf0W4AUDEpsozDdK63XvQ7s6h5s5sdH CZ61Se+zteVkuh0GseOGRmWNkt9pZbZKAxzyF1rFxVOwqbotTJLZihj1p823fKP0829l ApFBoG43HO4JfO1EORH7oqN3aRGduKeZbYIpo= MIME-Version: 1.0 Received: by 10.224.197.202 with SMTP id el10mr5974380qab.39.1321476321400; Wed, 16 Nov 2011 12:45:21 -0800 (PST) Received: by 10.229.249.7 with HTTP; Wed, 16 Nov 2011 12:45:21 -0800 (PST) In-Reply-To: References: Date: Wed, 16 Nov 2011 21:45:21 +0100 Message-ID: Subject: Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB? From: Marcello Nuccio To: user@couchdb.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Rob, I remember I've done something similar a while ago, but I cannot find the code right now and I don't have time to rewrite it right now... however the trick is to only count when the song name changes. This works because view rows are sorted. HTH, Marcello 2011/11/15 Rob Crowell : > Hello everyone, > > I'm writing some log-parsing code which is currently running on a > MySQL backend. =C2=A0We're doing a huge amount of aggregates on this data > right now, but performance is suffering and I'm looking for > alternatives. =C2=A0The idea of incremental map/reduce initially seemed > like the exact right thing, but I can't seem to express some of the > most important queries we are currently running in our production > system. > > We're running a lot of queries of the SELECT COUNT(DISTINCT song_id) > WHERE user_id =3D "boris" AND created >=3D "2010-01-01" AND created < > "2010-02-01" variety. =C2=A0Currently in MySQL-land we've got a cron job = to > pre-compute these aggregates (it checks modified timestamps and pulls > in only new records) and write them to a summary table. =C2=A0I initially > believed I could use CouchDB's incremental map/reduce to effortlessly > build and update our "summary information" as it changes, but I'm > stuck. =C2=A0I'm trying to relax, but I can't figure out exactly how :) > > In our example, our user "boris" listens to the same song many times > each month, and we're interested in the number of distinct songs he's > listened to during a specified time period (NOT the number of song > plays, but the number of distinct songs played). =C2=A0In CouchDB it isn'= t > much trouble to get all of the unique songs that he's listened to > during a period. =C2=A0Here's our document: > > { > =C2=A0song_id: "happy birthday", > =C2=A0user_id: "boris", > =C2=A0date_played: [2011, 11, 14, 00, 12, 55], > =C2=A0_id: ... > } > > To get the unique values, all we need to do is emit([doc.user_id, > doc.date_played, doc.song_id], null), reduce with _count, and query > with a startkey=3D["boris", "2011-01-01"]&endkey=3D["boris", > "2011-02-01"]&group_level=3D1. =C2=A0This query will yield results like: > > ["boris", "happy birthday"], 20 > ["boris", "yesterday"], 14 > ... > > However, if our user has listened to 50,000 songs during the date > range, we'll get back 50,000 rows which seems expensive. =C2=A0What I wan= t > is just the scalar 50,000. =C2=A0I've tried writing a reduce that returns > the set of distinct song_ids for each user (turning the values list > into a dictionary and back again), but CouchDB complains that I am not > reducing my values fast enough :-/ =C2=A0I'm also not sure how to reduce > this list to a scalar at the end without returning the whole thing to > my client (which defeats the purpose of all this anyways). > > Is this possible to do in CouchDB today? =C2=A0If not, is it something th= at > is on the roadmap, or does the internal structure of CouchDB's b-tree > make this really hard to do? =C2=A0It would of course be possible for me = to > implement this myself (subscribe to the update notifications and > update my counts as appropriate in a custom script), but I wanted to > move to CouchDB so that I wouldn't have to do all this myself. > > Thanks for any advice! > > --Rob >