Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 139CD99DA for ; Mon, 16 Apr 2012 20:12:33 +0000 (UTC) Received: (qmail 60144 invoked by uid 500); 16 Apr 2012 20:12:31 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 60101 invoked by uid 500); 16 Apr 2012 20:12:31 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 60092 invoked by uid 99); 16 Apr 2012 20:12:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Apr 2012 20:12:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alon.keren@gmail.com designates 209.85.212.52 as permitted sender) Received: from [209.85.212.52] (HELO mail-vb0-f52.google.com) (209.85.212.52) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Apr 2012 20:12:26 +0000 Received: by vbzb23 with SMTP id b23so5260122vbz.11 for ; Mon, 16 Apr 2012 13:12:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=X7oFN5nDal8CluB/EPadXJ7BM11uvkb0JZPJ1AitVi0=; b=NRE/gbc4j+e1K6RqTSxJLSHHT0PbwuifO3pdH3OI45p+GsyE20H+Eh1RRoCoajxRQH FZAqEH8KMzkhUOcBN2EsUn2+7VPaK5cyrRunBOdCf+GePa47K/7+y9vAirPTrX000OaY GP/IXcFt77vdbiGywCuGxTxcbRooLgxFQG/Vft5VvviF4k3tj9pW0tZWcvJXoc99BvoC q7lT7vmYsqRvy76sPxcULzTXbaaQFJur/2lXJ667aqleP02KHruzjw7aIS68v6S1LOwf k4KfthDlnE+soJMUi38EndLOO422nB+9sscvyQvWZd6Fz5BfIEl8+rP5pHTBG2hFcAz9 fBVw== MIME-Version: 1.0 Received: by 10.220.141.15 with SMTP id k15mr7292322vcu.30.1334607125175; Mon, 16 Apr 2012 13:12:05 -0700 (PDT) Received: by 10.52.65.42 with HTTP; Mon, 16 Apr 2012 13:12:04 -0700 (PDT) In-Reply-To: <20120416192525.GC29414@translab.its.uci.edu> References: <20120415061323.GD8135@translab.its.uci.edu> <20120416192525.GC29414@translab.its.uci.edu> Date: Mon, 16 Apr 2012 23:12:04 +0300 Message-ID: Subject: Re: Reduce just N rows? From: Alon Keren To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=f46d042f9d3c9c5b4304bdd16fe9 X-Virus-Checked: Checked by ClamAV on apache.org --f46d042f9d3c9c5b4304bdd16fe9 Content-Type: text/plain; charset=ISO-8859-1 On 16 April 2012 22:25, James Marca wrote: > On Sun, Apr 15, 2012 at 12:00:38PM +0300, Alon Keren wrote: > > On 15 April 2012 09:13, James Marca wrote: > > > > > CouchDB will compute reduced values for what you select. If you just > > > ask for values from A to B, it *only* will compute the reduced values > > > over that range. So you can get "clever" with the key value, using > > > something like > > > > > > map: emit( [user,game,trynumber], score); > > > > > > where trynumber is some value that is guaranteed to increase with each > > > completed game score stored. > > > > > > Your reduce could use the built-in Erlang _sum > > > > > > Then you can just request something like...hmm > > > > > > startKey=[user,game,BIGNUMBER]&order=descending&limit=10&reduce=false > > > (where BIGNUMBER is something bigger than the highest try number of > game). > > > > > > This will give 10 values, and you can do the average lickety-split > > > client side, OR you can do one query to get highest try number, then > > > another to get between that game and ten back to let couch compute the > > > sum for you. > > > > > > > Thanks! > > > > I think a simpler alternative to 'trynumber' is the game's timestamp, and > > BIGNUMBER could be replaced by '{}' (see: > > http://wiki.apache.org/couchdb/View_collation). That's what I'm doing at > > the moment :) > > Unfortunately, as numbers of games and game-types grow, this would become > > pretty demanding in CPU time and number of calls to couch. > > > > I thought about timestamp first, but you said you wanted the last 10, > and I wanted to be able to pipe the request through reduce. > > With timestamps you have to do two requests to get current and 10 > prior, or a single request without reducing. > > At the risk of stating the obvious, if you ask for "limit=10" in a > request, *and* the request goes through reduce, you will get 10 > reduced values, not 10 values that get reduced to one. By using an > integer value, you can do the simple request I settled on above (give > me ten values, no reduce), OR in a real application you probably know > the current last game number, so you can pass start and end keys (end > key is ten less than current game number) and force just 10 results to > get piped through reduce. > Ah, I think I see now what you're getting it - thanks for clarifying. It seems to me that even with this approach, if I want to use the db's reduce, I would have to make a separate query for each game type. Or am I missing something? > > Also, I really don't think there is any load at all on the CPU with > this approach. Or to be more accurate, no more than any active > database processing a view. Again apologies for stating the obvious, > but CouchDB does incremenal updates of views, so if you keep adding > data, it only processes the new data. Once you have processed the > data into a view, querying it (without reduce) takes almost no CPU. > Reducing it can be expensive if you do something in JavaScript, but > isn't as expensive if you stick with the built in native Erlang reduce > functions (sum, count, etc). > Reduces in couchdb should be incremental, unlike when doing them outside of couch. > > But one thing to keep in mind is that you can probably use multiple > databases. Is there any reason you *have* to put all the games and all > the users in a single database? Can you have a database per game? Or > a database per user? then the Views are only updated when a > particular user is adding results and querying results. > > I do data collection from sensors with CouchDB. I use one database > per sensor per year of data, roughly a thousand or so DBs per year. I > do this so I can eventually spread the pain on multiple machines (I > haven't really had to yet), and because Erlang does a really good job > maxing out a multicore machine if it has a lot of jobs to run. With > just one database, I was only getting two cores busy, but with > thousands (when processing historical data) all 8 cores on my two > servers were very busy. > > I also keep one database to do aggregations across all the detectors > at the hourly level (I have 30 second data). Each db has a view that > generates hourly summaries I need, and I have a node process that > polls all the databases at the end of each day to collect hourly > documents and writes them to the collation database, which has other > views. Kind of a manual version of your chained map reduce project > (incarnate, right?), but it suits the data better than automating it. > > For your app, suppose there are a million users all playing any of a > thousand games. If evey user posts a new score every second, ideally > I would only want to make each player wait for their data to get processed, > not the data from the other 999,999 players. So that calls for a > database per user. If users have to wait for Erlang to finish other > jobs before it can schedule the user's job on a CPU, then you need > more CPUs. With just one database, you don't get that choice, you > have to wait for all of the data to get processed (unless you allow > stale views. > Actually, several users can participate in each game, but their scores are individual. However, there should be enough user specific data that's derived from these games, so it may be a good optimization down the line to put at least this kind of data in user-specific databases. > > As with my app, across users you can have a separate database that queries > each > db for that user's last 10 once every minute or so (changes feed would > probably work really well here...a change adds a callback to get data > from that database when the periodic process is run) and updates a > collating db with username_game_average type of documents, to get the > user's standings compared to other players. > > Regards, > james > > PS, sorry for the long reply. I've had too much coffee today. > > Nothing to be sorry about - thanks a lot for giving it so much attention, James! Alon --f46d042f9d3c9c5b4304bdd16fe9--