incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Marca <jma...@translab.its.uci.edu>
Subject Re: Reduce just N rows?
Date Mon, 16 Apr 2012 19:25:25 GMT
On Sun, Apr 15, 2012 at 12:00:38PM +0300, Alon Keren wrote:
> On 15 April 2012 09:13, James Marca <jmarca@translab.its.uci.edu> wrote:
> 
> > CouchDB will compute reduced values for what you select.  If you just
> > ask for values from A to B, it *only* will compute the reduced values
> > over that range.  So you can get "clever" with the key value, using
> > something like
> >
> > map: emit( [user,game,trynumber], score);
> >
> > where trynumber is some value that is guaranteed to increase with each
> > completed game score stored.
> >
> > Your reduce could use the built-in Erlang  _sum
> >
> > Then you can just request something like...hmm
> >
> > startKey=[user,game,BIGNUMBER]&order=descending&limit=10&reduce=false
> > (where BIGNUMBER is something bigger than the highest try number of game).
> >
> > This will give 10 values, and you can do the average lickety-split
> > client side, OR you can do one query to get highest try number, then
> > another to get between that game and ten back to let couch compute the
> > sum for you.
> >
> 
> Thanks!
> 
> I think a simpler alternative to 'trynumber' is the game's timestamp, and
> BIGNUMBER could be replaced by '{}' (see:
> http://wiki.apache.org/couchdb/View_collation). That's what I'm doing at
> the moment :)
> Unfortunately, as numbers of games and game-types grow, this would become
> pretty demanding in CPU time and number of calls to couch.
> 

I thought about timestamp first, but you said you wanted the last 10,
and I wanted to be able to pipe the request through reduce.

With timestamps you have to do two requests to get current and 10
prior, or a single request without reducing.

At the risk of stating the obvious, if you ask for "limit=10" in a
request, *and* the request goes through reduce, you will get 10
reduced values, not 10 values that get reduced to one.  By using an
integer value, you can do the simple request I settled on above (give
me ten values, no reduce), OR in a real application you probably know
the current last game number, so you can pass start and end keys (end
key is ten less than current game number) and force just 10 results to
get piped through reduce.

Also, I really don't think there is any load at all on the CPU with
this approach.  Or to be more accurate, no more than any active
database processing a view.  Again apologies for stating the obvious,
but CouchDB does incremenal updates of views, so if you keep adding
data, it only processes the new data.  Once you have processed the
data into a view, querying it (without reduce) takes almost no CPU.
Reducing it can be expensive if you do something in JavaScript, but
isn't as expensive if you stick with the built in native Erlang reduce
functions (sum, count, etc).

But one thing to keep in mind is that you can probably use multiple
databases. Is there any reason you *have* to put all the games and all
the users in a single database?  Can you have a database per game?  Or
a database per user?  then the Views are only updated when a
particular user is adding results and querying results.

I do data collection from sensors with CouchDB.  I use one database
per sensor per year of data, roughly a thousand or so DBs per year.  I
do this so I can eventually spread the pain on multiple machines (I
haven't really had to yet), and because Erlang does a really good job
maxing out a multicore machine if it has a lot of jobs to run.  With
just one database, I was only getting two cores busy, but with
thousands (when processing historical data) all 8 cores on my two
servers were very busy.

I also keep one database to do aggregations across all the detectors
at the hourly level (I have 30 second data).  Each db has a view that
generates hourly summaries I need, and I have a node process that
polls all the databases at the end of each day to collect hourly
documents and writes them to the collation database, which has other
views.  Kind of a manual version of your chained map reduce project
(incarnate, right?), but it suits the data better than automating it.

For your app, suppose there are a million users all playing any of a
thousand games.  If evey user posts a new score every second, ideally
I would only want to make each player wait for their data to get processed,
not the data from the other 999,999 players.  So that calls for a
database per user.  If users have to wait for Erlang to finish other
jobs before it can schedule the user's job on a CPU, then you need
more CPUs.   With just one database, you don't get that choice, you
have to wait for all of the data to get processed (unless you allow
stale views.

As with my app, across users you can have a separate database that queries each
db for that user's last 10 once every minute or so (changes feed would
probably work really well here...a change adds a callback to get data
from that database when the periodic process is run) and updates a
collating db with username_game_average type of documents, to get the
user's standings compared to other players.

Regards,
james

PS, sorry for the long reply. I've had too much coffee today.


Mime
View raw message