Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of alon.keren@gmail.com
 designates 209.85.212.52 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <20120416192525.GC29414@translab.its.uci.edu>
References: 
 <CABzyt1G-19VR-To-ph=RqjYTUN0xZEFGs3TzJU8bO_ExsJeqtw@mail.gmail.com>
	<20120415061323.GD8135@translab.its.uci.edu>
	<CABzyt1HB_VD8n31qCQuBdcS-h2DGANMqwx+KRO+dfqHJ8TUDSw@mail.gmail.com>
	<20120416192525.GC29414@translab.its.uci.edu>
Date: Mon, 16 Apr 2012 23:12:04 +0300
Message-ID: 
 <CABzyt1FQc0wOj1LPFnhRB9v4Uc+p5yO1UdO4bc8KGm12fwDmvA@mail.gmail.com>
Subject: Re: Reduce just N rows?
From: Alon Keren <alon.keren@gmail.com>
To: user@couchdb.apache.org
Content-Type: multipart/alternative; boundary=f46d042f9d3c9c5b4304bdd16fe9

--f46d042f9d3c9c5b4304bdd16fe9
Content-Type: text/plain; charset=ISO-8859-1

On 16 April 2012 22:25, James Marca <jmarca@translab.its.uci.edu> wrote:

> On Sun, Apr 15, 2012 at 12:00:38PM +0300, Alon Keren wrote:
> > On 15 April 2012 09:13, James Marca <jmarca@translab.its.uci.edu> wrote:
> >
> > > CouchDB will compute reduced values for what you select.  If you just
> > > ask for values from A to B, it *only* will compute the reduced values
> > > over that range.  So you can get "clever" with the key value, using
> > > something like
> > >
> > > map: emit( [user,game,trynumber], score);
> > >
> > > where trynumber is some value that is guaranteed to increase with each
> > > completed game score stored.
> > >
> > > Your reduce could use the built-in Erlang  _sum
> > >
> > > Then you can just request something like...hmm
> > >
> > > startKey=[user,game,BIGNUMBER]&order=descending&limit=10&reduce=false
> > > (where BIGNUMBER is something bigger than the highest try number of
> game).
> > >
> > > This will give 10 values, and you can do the average lickety-split
> > > client side, OR you can do one query to get highest try number, then
> > > another to get between that game and ten back to let couch compute the
> > > sum for you.
> > >
> >
> > Thanks!
> >
> > I think a simpler alternative to 'trynumber' is the game's timestamp, and
> > BIGNUMBER could be replaced by '{}' (see:
> > http://wiki.apache.org/couchdb/View_collation). That's what I'm doing at
> > the moment :)
> > Unfortunately, as numbers of games and game-types grow, this would become
> > pretty demanding in CPU time and number of calls to couch.
> >
>
> I thought about timestamp first, but you said you wanted the last 10,
> and I wanted to be able to pipe the request through reduce.
>
> With timestamps you have to do two requests to get current and 10
> prior, or a single request without reducing.
>
> At the risk of stating the obvious, if you ask for "limit=10" in a
> request, *and* the request goes through reduce, you will get 10
> reduced values, not 10 values that get reduced to one.  By using an
> integer value, you can do the simple request I settled on above (give
> me ten values, no reduce), OR in a real application you probably know
> the current last game number, so you can pass start and end keys (end
> key is ten less than current game number) and force just 10 results to
> get piped through reduce.
>

Ah, I think I see now what you're getting it - thanks for clarifying.
It seems to me that even with this approach, if I want to use the db's
reduce, I would have to make a separate query for each game type. Or am I
missing something?


>
> Also, I really don't think there is any load at all on the CPU with
> this approach.  Or to be more accurate, no more than any active
> database processing a view.  Again apologies for stating the obvious,
> but CouchDB does incremenal updates of views, so if you keep adding
> data, it only processes the new data.  Once you have processed the
> data into a view, querying it (without reduce) takes almost no CPU.
> Reducing it can be expensive if you do something in JavaScript, but
> isn't as expensive if you stick with the built in native Erlang reduce
> functions (sum, count, etc).
>

Reduces in couchdb should be incremental, unlike when doing them outside of
couch.


>
> But one thing to keep in mind is that you can probably use multiple
> databases. Is there any reason you *have* to put all the games and all
> the users in a single database?  Can you have a database per game?  Or
> a database per user?  then the Views are only updated when a
> particular user is adding results and querying results.
>
> I do data collection from sensors with CouchDB.  I use one database
> per sensor per year of data, roughly a thousand or so DBs per year.  I
> do this so I can eventually spread the pain on multiple machines (I
> haven't really had to yet), and because Erlang does a really good job
> maxing out a multicore machine if it has a lot of jobs to run.  With
> just one database, I was only getting two cores busy, but with
> thousands (when processing historical data) all 8 cores on my two
> servers were very busy.
>
> I also keep one database to do aggregations across all the detectors
> at the hourly level (I have 30 second data).  Each db has a view that
> generates hourly summaries I need, and I have a node process that
> polls all the databases at the end of each day to collect hourly
> documents and writes them to the collation database, which has other
> views.  Kind of a manual version of your chained map reduce project
> (incarnate, right?), but it suits the data better than automating it.
>
> For your app, suppose there are a million users all playing any of a
> thousand games.  If evey user posts a new score every second, ideally
> I would only want to make each player wait for their data to get processed,
> not the data from the other 999,999 players.  So that calls for a
> database per user.  If users have to wait for Erlang to finish other
> jobs before it can schedule the user's job on a CPU, then you need
> more CPUs.   With just one database, you don't get that choice, you
> have to wait for all of the data to get processed (unless you allow
> stale views.
>

Actually, several users can participate in each game, but their scores are
individual.
However, there should be enough user specific data that's derived from
these games, so it may be a good optimization down the line to put at least
this kind of data in user-specific databases.


>
> As with my app, across users you can have a separate database that queries
> each
> db for that user's last 10 once every minute or so (changes feed would
> probably work really well here...a change adds a callback to get data
> from that database when the periodic process is run) and updates a
> collating db with username_game_average type of documents, to get the
> user's standings compared to other players.
>

> Regards,
> james
>
> PS, sorry for the long reply. I've had too much coffee today.
>
>
Nothing to be sorry about - thanks a lot for giving it so much attention,
James!

  Alon

--f46d042f9d3c9c5b4304bdd16fe9--