Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 32628911B for ; Wed, 16 Nov 2011 21:41:18 +0000 (UTC) Received: (qmail 25086 invoked by uid 500); 16 Nov 2011 21:41:16 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 25053 invoked by uid 500); 16 Nov 2011 21:41:16 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 25045 invoked by uid 99); 16 Nov 2011 21:41:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Nov 2011 21:41:16 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zachary.zolton@gmail.com designates 209.85.210.180 as permitted sender) Received: from [209.85.210.180] (HELO mail-iy0-f180.google.com) (209.85.210.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Nov 2011 21:41:10 +0000 Received: by iagz35 with SMTP id z35so1673321iag.11 for ; Wed, 16 Nov 2011 13:40:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=nuevFW0HYfyqWi899pmL5Vh6eMfH6RAeHpg45jnj1dU=; b=NXhKMbFR0f8vfND6MGcoQ7Wq6QIfK4mdCR8XFKt3sLufpqos77DIbgH3o6ANjPT5LA iHhmRoxVJ6oaU/B8ZLM+llm/+xjeM0whYofE8ODQ+Nx90+RwOGM5jVEEVAR0QjksIXk6 nF177gyPDKq98rjX7sPEg+c/yxFam0m367DkQ= Received: by 10.42.117.193 with SMTP id u1mr35269775icq.24.1321479649303; Wed, 16 Nov 2011 13:40:49 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.188.13 with HTTP; Wed, 16 Nov 2011 13:40:18 -0800 (PST) In-Reply-To: References: From: Zachary Zolton Date: Wed, 16 Nov 2011 15:40:18 -0600 Message-ID: Subject: Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB? To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Except that the OP specified the need to query for these counts within a date range. So, you have to collate by listening time, not the song. On Wed, Nov 16, 2011 at 2:45 PM, Marcello Nuccio wrote: > Hi Rob, > I remember I've done something similar a while ago, but I cannot find > the code right now and I don't have time to rewrite it right now... > however the trick is to only count when the song name changes. This > works because view rows are sorted. > > HTH, > =A0Marcello > > 2011/11/15 Rob Crowell : >> Hello everyone, >> >> I'm writing some log-parsing code which is currently running on a >> MySQL backend. =A0We're doing a huge amount of aggregates on this data >> right now, but performance is suffering and I'm looking for >> alternatives. =A0The idea of incremental map/reduce initially seemed >> like the exact right thing, but I can't seem to express some of the >> most important queries we are currently running in our production >> system. >> >> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id) >> WHERE user_id =3D "boris" AND created >=3D "2010-01-01" AND created < >> "2010-02-01" variety. =A0Currently in MySQL-land we've got a cron job to >> pre-compute these aggregates (it checks modified timestamps and pulls >> in only new records) and write them to a summary table. =A0I initially >> believed I could use CouchDB's incremental map/reduce to effortlessly >> build and update our "summary information" as it changes, but I'm >> stuck. =A0I'm trying to relax, but I can't figure out exactly how :) >> >> In our example, our user "boris" listens to the same song many times >> each month, and we're interested in the number of distinct songs he's >> listened to during a specified time period (NOT the number of song >> plays, but the number of distinct songs played). =A0In CouchDB it isn't >> much trouble to get all of the unique songs that he's listened to >> during a period. =A0Here's our document: >> >> { >> =A0song_id: "happy birthday", >> =A0user_id: "boris", >> =A0date_played: [2011, 11, 14, 00, 12, 55], >> =A0_id: ... >> } >> >> To get the unique values, all we need to do is emit([doc.user_id, >> doc.date_played, doc.song_id], null), reduce with _count, and query >> with a startkey=3D["boris", "2011-01-01"]&endkey=3D["boris", >> "2011-02-01"]&group_level=3D1. =A0This query will yield results like: >> >> ["boris", "happy birthday"], 20 >> ["boris", "yesterday"], 14 >> ... >> >> However, if our user has listened to 50,000 songs during the date >> range, we'll get back 50,000 rows which seems expensive. =A0What I want >> is just the scalar 50,000. =A0I've tried writing a reduce that returns >> the set of distinct song_ids for each user (turning the values list >> into a dictionary and back again), but CouchDB complains that I am not >> reducing my values fast enough :-/ =A0I'm also not sure how to reduce >> this list to a scalar at the end without returning the whole thing to >> my client (which defeats the purpose of all this anyways). >> >> Is this possible to do in CouchDB today? =A0If not, is it something that >> is on the roadmap, or does the internal structure of CouchDB's b-tree >> make this really hard to do? =A0It would of course be possible for me to >> implement this myself (subscribe to the update notifications and >> update my counts as appropriate in a custom script), but I wanted to >> move to CouchDB so that I wouldn't have to do all this myself. >> >> Thanks for any advice! >> >> --Rob >> >