incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From J Chris Anderson <jch...@couch.io>
Subject Re: Map reduce and weird output question
Date Mon, 08 Mar 2010 20:13:28 GMT

On Mar 8, 2010, at 12:28 AM, Gregory Tappero wrote:

> Thanks,
> 
> I got the wanted result with
> http://friendpaste.com/6sYxT4cNJ9IjpWiW9qgCut
> 
> benoitc came to my rescue.
> 

The will be a problem with large databases. When the # of unique users is large, the group=false
query would return a very large object with all the users names in it. Except it won't because
it will raise a reduce_overflow_error.

Your problem is interesting. You might learn from reading this paper:

http://labs.google.com/papers/sawzall.html

It gives a survey of the available algorithms which can work in constant space even over large
databases.

Chris

> Greg
> 
> 
> 
> 
> On Mon, Mar 8, 2010 at 9:07 AM, Paweł Stawicki <pawelstawicki@gmail.com> wrote:
>> Hmm... I'm just thinking now, don't know if it works, but maybe try
>> something like this:
>> If you can get number of documents per day per username, first try to make
>> this number always one if keys is [date, username]:
>> Reduce:
>> if (keys.length == 2) {
>>  return 1;
>> } else if (keys.length == 1) { //date only, return number of usernames
>>  return values.length();
>> }
>> 
>> The risk is that some usernames will count twice, but maybe try it.
>> 
>> Best regards
>> --
>> Paweł Stawicki
>> http://pawelstawicki.blogspot.com
>> http://szczecin.jug.pl
>> 
>> 
>> 
>> On Mon, Mar 8, 2010 at 08:03, Gregory Tappero <coulix@gmail.com> wrote:
>> 
>>> My number of keys is 4, year month day userame so returning the bbr of
>>> keys in reduce does not seem to give me the output i am looking for.
>>> Unless i misunderstood something.
>>> 
>>> Thank you for helping,
>>> 
>>> Greg
>>> 
>>> On Mon, Mar 8, 2010 at 12:28 AM, Randall Leeds <randall.leeds@gmail.com>
>>> wrote:
>>>> I'm not an expert on this, but I think you need to create your own
>>>> reduce function and output the number of keys rather than the sum of
>>>> the values.
>>>> 
>>>> On Sun, Mar 7, 2010 at 15:15, Gregory Tappero <coulix@gmail.com> wrote:
>>>>> Thank you Pawel,
>>>>> 
>>>>> If i try to follow your way it gives me the count of docs in a given
>>>>> day for each username, what i would like is the count of unique
>>>>> usernames for a given day.
>>>>> 
>>>>> function(doc) {
>>>>> 
>>>>>    if (doc.doc_type=="EdoPing" && doc.em_type==0) {
>>>>>        date = new Date().setRFC3339(doc.created_at);
>>>>>        emit([date.getFullYear(), parseInt(date.getMonth())+1,
>>>>> date.getDate(), doc.em_uname] , 1);
>>>>> 
>>>>>    }
>>>>> }
>>>>> 
>>>>> Reduce:
>>>>>  _count
>>>>> 
>>>>> =================
>>>>> I get:
>>>>> 
>>>>> [2010, 3, 3, "student1"]         5
>>>>> [2010, 3, 4, "student1"]         18
>>>>> [2010, 3, 5, "eong"]             77
>>>>> [2010, 3, 6, "bkante"]           71
>>>>> [2010, 3, 6, "jfrancillette"]    72
>>>>> [2010, 3, 6, "mlouviers"]        12
>>>>> [2010, 3, 7, "student1"]         4
>>>>> 
>>>>> I would like to extract the following
>>>>> 
>>>>> [2010, 3, 3]       1
>>>>> [2010, 3, 4]       1
>>>>> [2010, 3, 5]    1
>>>>> [2010, 3, 6]       3
>>>>> [2010, 3, 7]       1
>>>>> 
>>>>> 
>>>>> if i do a group_level=3 it sum the values.
>>>>> 
>>>>> {"key":[2010,3,3],"value":5},
>>>>> {"key":[2010,3,4],"value":18},
>>>>> {"key":[2010,3,5],"value":77},
>>>>> {"key":[2010,3,6],"value":155},
>>>>> {"key":[2010,3,7],"value":4}
>>>>> 
>>>>> How can i count the unique username emitter per day ?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Mar 7, 2010 at 10:02 PM, Paweł Stawicki <
>>> pawelstawicki@gmail.com> wrote:
>>>>>> Just emit all documents with em_type = 0 in map function, with [date,
>>>>>> em_uname] as key. Then count in reduce.
>>>>>> 
>>>>>> Map:
>>>>>> function(doc) {
>>>>>>  if (doc.em_type = 0) {
>>>>>>    //If you only want to count, you can emit anything (e.g. 1) instead
>>> of
>>>>>> doc here.
>>>>>>    emit([date, em_uname], doc);
>>>>>>  }
>>>>>> }
>>>>>> 
>>>>>> Reduce:
>>>>>> function(keys, values, rereduce) {
>>>>>>  if (!rereduce) {
>>>>>>    return count_of_values;
>>>>>>  } else {
>>>>>>    return sum_of_values;
>>>>>>  }
>>>>>> 
>>>>>>  //If you return 1 from emit instead of doc, then count_of_values
==
>>>>>> sum_of_values
>>>>>> }
>>>>>> 
>>>>>> Then you can handle everything by grouping:
>>>>>> http://yourserver:5984/yourdb/_view/yourview?group_level=2
>>>>>> or group=true
>>>>>> 
>>>>>> Regards
>>>>>> --
>>>>>> Paweł Stawicki
>>>>>> http://pawelstawicki.blogspot.com
>>>>>> http://szczecin.jug.pl
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Mar 6, 2010 at 16:26, Gregory Tappero <coulix@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Hello everyone,
>>>>>>> 
>>>>>>> I have the following EdoPing 's type of documents
>>>>>>> 
>>>>>>> {
>>>>>>>   "_id": "22add509c1e7bc286832edc5bfe99ce5",
>>>>>>>   "_rev": "1-49663ab8778f445e481143120d0d7086",
>>>>>>>   "doc_type": "EdoPing",
>>>>>>>   "em_uname": "student1",
>>>>>>>   "em_gid": 1,
>>>>>>>   "created_at": "2010-03-03T14:18:19Z",
>>>>>>>   "em_ip": "92.154.70.148",
>>>>>>>   "em_type": 0,
>>>>>>>   "room_url": "z2fudcvcrfa3reaydatre",
>>>>>>>   "room_users": [
>>>>>>>       "tutorsbox"
>>>>>>>   ]
>>>>>>> }
>>>>>>> 
>>>>>>> i would like to count all unique em_uname of em_type 0 on a given
day
>>> date.
>>>>>>> 
>>>>>>> For now i used this map/reduce
>>>>>>> http://friendpaste.com/5xUUQ26bbl9d5KRB8eojwe
>>>>>>> 
>>>>>>> Date.prototype.setRFC3339 = function(dString){
>>>>>>>    var regexp =
>>>>>>> 
>>>>>>> 
>>> /(\d\d\d\d)(-)?(\d\d)(-)?(\d\d)(T)?(\d\d)(:)?(\d\d)(:)?(\d\d)(\.\d+)?(Z|([+-])(\d\d)(:)?(\d\d))/;
>>>>>>> 
>>>>>>>    if (dString.toString().match(new RegExp(regexp))) {
>>>>>>>        var d = dString.match(new RegExp(regexp));
>>>>>>>        var offset = 0;
>>>>>>> 
>>>>>>>        this.setUTCDate(1);
>>>>>>>        this.setUTCFullYear(parseInt(d[1],10));
>>>>>>>        this.setUTCMonth(parseInt(d[3],10) - 1);
>>>>>>>        this.setUTCDate(parseInt(d[5],10));
>>>>>>>        this.setUTCHours(parseInt(d[7],10));
>>>>>>>        this.setUTCMinutes(parseInt(d[9],10));
>>>>>>>        this.setUTCSeconds(parseInt(d[11],10));
>>>>>>>        if (d[12])
>>>>>>>            this.setUTCMilliseconds(parseFloat(d[12]) * 1000);
>>>>>>>        else
>>>>>>>            this.setUTCMilliseconds(0);
>>>>>>>        if (d[13] != 'Z') {
>>>>>>>            offset = (d[15] * 60) + parseInt(d[17],10);
>>>>>>>            offset *= ((d[14] == '-') ? -1 : 1);
>>>>>>>            this.setTime(this.getTime() - offset * 60 * 1000);
>>>>>>>        }
>>>>>>>    } else {
>>>>>>>        this.setTime(Date.parse(dString));
>>>>>>>    }
>>>>>>>    return this;
>>>>>>> };
>>>>>>> 
>>>>>>> var seenKeys = new Array();
>>>>>>> 
>>>>>>> function(doc) {
>>>>>>> 
>>>>>>> 
>>>>>>>    if (doc.doc_type=="EdoPing" && doc.em_type==0) {
>>>>>>>        date = new Date().setRFC3339(doc.created_at);
>>>>>>>        var key = doc.em_uname +
>>> String(doc.created_at).substring(0,10);
>>>>>>>        if (seenKeys[key] ==  undefined  ) {
>>>>>>>            seenKeys[key] = 1;
>>>>>>>            emit([date.getFullYear(), parseInt(date.getMonth())+1,
>>>>>>> date.getDate() ] , 1);
>>>>>>>         }
>>>>>>>    }
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> It works when saved for this first time but as soon as new EdoPings
>>>>>>> get added it starts emitting rows it has already seen ! (same
key)
>>>>>>> creating faulty count results.
>>>>>>> 
>>>>>>> Is it ok to have seenKeys outside of the doc function() ?
>>>>>>> What other way could i use to get the same results ?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Greg
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Greg Tappero
>>>>> CTO co founder Edoboard
>>>>> http://www.edoboard.com
>>>>> +33 0645764425
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Greg Tappero
>>> CTO co founder Edoboard
>>> http://www.edoboard.com
>>> +33 0645764425
>>> 
>> 
> 
> 
> 
> -- 
> Greg Tappero
> CTO co founder Edoboard
> http://www.edoboard.com
> +33 0645764425


Mime
View raw message