incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <B.Cand...@pobox.com>
Subject Re: 'Grouping' documents so that a set of documents is passed to the view function
Date Thu, 25 Jun 2009 08:34:31 GMT
On Thu, Jun 25, 2009 at 09:24:31AM +0800, hhsuper wrote:
>    I descripe the application scenario carefully: when user learn from one
>    dialog, they start a session( sessionid), the study on every line in
>    dialog generate a couchdb document(there are uid/dialogid/sessionid,
>    wordcount/weightedScore/grade for the line), the user could re-study
>    the same dialog some days later, so they start a new session but for
>    the same dialog, we want get every user's average grade from their
>    study results(dialog as unit, so we need sum for specified session) but
>    for the same dialog we only want to use the highest grade of  session
>    not use all session

This sounds exactly like the sort of logic which should be implemented in
the client. Download all the user's results, process them, display them.

>    this seem to difficult to impl with one view,  as impl in rdbms we need
>    build sql query on a subquery(or on a db view), is that proper to impl
>    with couchdb's view?

I don't really understand why you need a subquery in rdbms. I would just
select all results where uid=x, and process them as required (for example:
build a hash of dialogid=>bestScore and update it from each received row)

>    function code bellow(code maybe already complex), by the way when you
>    said "root node with *all* the uids", i think i don't very clearly
>    about the view's internal store structure and i can't find in wiki:

Good info in this blog posting (prob. should be linked from Wiki):
http://horicky.blogspot.com/2008/10/couchdb-implementation.html

A reduce value can be any Javascript value (number, string, array or
object). Some people have tried to build summary objects of the form
   {
    uid1: [some data],
    uid2: [some more data],
    uid3: [even more data],
    ...
   }

The problem here is that if you have a million uids, the reduce value will
be an object with a million members. And the reduce value is *stored* in the
root btree node. In fact, every btree node stores the reduce value for the
documents in that node and its children. This means reduce will become
ridiculously slow.

However, in your reduce function, you are always reducing to a single object
with three members, which is fine. The reduce value in the root node will be
a reduce calculated across all users, which may or may not mean anything,
but doesn't do any harm either.

>    function(keys, values, rereduce) {
>      var wordCount = 0;
>      var weightedScore = 0;
>      if( !rereduce ) {
>        // This is the reduce phase, we are reducing over emitted values
>    from the map functions.
>        var sessions = {};
>        for(var k in keys){
>            //caculate the total value for every session(contain multi
>    sessiondialog<=>couchdb document)
>            var key = keys[k][0];
>            key = key?key.join('_'):key;
>            if (!sessions[key]) {
>                sessions[key] = values[k];
>            }else{
>                sessions[key].wordCount += values[k].wordCount;
>                sessions[key].weightedScore += values[k].weightedScore;
>                sessions[key].grade =
>    sessions[key].weightedScore/sessions[key].wordCount;
>            }
>        }
>        //caculate the top session for each dialog
>        var dialogsessions = {};
>        for(var sk in sessions){
>            var dialogId = sk?sk.split('_')[1]:sk;
>            if(!dialogsessions[dialogId]){
>                dialogsessions[dialogId] = sessions[sk];
>            }else if(dialogsessions[dialogId].grade < sessions[sk].grade){
>                dialogsessions[dialogId] = sessions[sk];
>            }
>        }
>        //caculate the result
>        for(var ds in dialogsessions){
>            wordCount += dialogsessions[ds].wordCount;
>            weightedScore += dialogsessions[ds].weightedScore;
>        }
>      } else {
>        // This is the rereduce phase, we are re-reducing previosuly
>    reduced values.
>        for(var i in values) {
>          wordCount += values[i].wordCount;
>          weightedScore += values[i].weightedScore;
>        }
>      }
>      return {"wordCount"    : wordCount,
>              "weightedScore"    : weightedScore,
>              "grade" : weightedScore/wordCount
>         };
>    }

This looks wrong to me. If the re-reduce function only has to sum wordCount
and weightedScore, why isn't such simple logic also in the reduce function?

Your reduce function is non-linear. In particular, it searches for the
maximum value of something. This seems incompatible with a linear re-reduce
function. It's certainly fine to have reduce and re-reduce functions which
calculate maxima, but I would expect "find maximum" logic in both reduce and
re-reduce.

With your logic, my suspicion is that you will get different answers
depending on exactly how the input documents are divided between the reduce
and re-reduce phases. Since you have no control over that, it means your
answers will be wrong sometimes.

Perhaps it will help you to understand this if you consider the limiting
case where exactly one document is fed into the 'reduce' function at a time,
and then the outputs of the reduce functions are combined with a large
re-reduce phase. (Also consider what happens if exactly one document at a
time is fed into the 'reduce' function, and then pairs of reduce values are
fed into 'rereduce' forming a binary tree)

I wouldn't be surprised if there's a constant you can tweak somewhere in the
CouchDB source code which would let you actually get this behaviour in
practice.

Also, someone wrote a Javascript implementation of map/reduce which would
let you play with this interactively. This is linked from the very bottom of
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views

HTH,

Brian.

Mime
View raw message