couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthieu Rakotojaona <matthieu.rakotoja...@gmail.com>
Subject Re: Doubt on map/reduce and "joins" by id
Date Mon, 15 Oct 2012 17:58:35 GMT
Hello,

The wiki has a page regarding reduce functions : See
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Reduce_Functions
Here are a few more notes about how to use them. You might be
accustomed to some parts of it, but I thought it could serve for other
bypassers too :

1/ reduce takes 3 arguments: keys (an array), values (an array) and
rereduce (a boolean)

2/ reduce will be called multiple times in a database life, and those
calls can be categorized in 2 groups:
  1/ "First" they will be called on maps outputs. In this case:
     * rereduce = false
     * keys is an array of [key, id] where key is the key you emitted
in the map and id is the id of the doc
     * values is an array of the values you emitted in the map.
  keys and values ids are correlated : keys[7] and values[7] will hold
related information (ie for the same emitted row)
  2/ "Then" they will be called on reduce outputs. In this case:
     * rereduce = true
     * keys = nil
     * values is an array of objects you returned in previous reduce function

  This second case has a fundamental conclusion : a reduce output
_must_ be usable as an other reduce input.
  "First" and "Then" are misleading, because you never know _when_ the
reduce will be called. But you know _how_, and it's enough.

3/ Differentiating if rereduce is true or false can be a pain because
usually you have to think about 2 different data structures (one is
the map output, one is the reduce output).
There might be views where you don't care very much about map-only
output, and you want only map+reduce output : in this case, you should
try outputting structures
that will already be usable by reduce functions in your map.

I just had some "similar" case recently, and I did what you can find
in https://github.com/rakoo/pfeed/tree/master/pfeed-couch/views/feeds-stats.
Basically :

* I have 2 types of data in my db : "feed" and "entry". "feed"s have
unique id and a unique title; "entry"s have a feed_id and a
state(isRead or isUnread). One feed can have any number of entries.
* I wanted to count the number of each state for a feed
* I resorted to this kind of structure, as a target output:
{
  "title": <title>,
  "isUnread": <some number>,
  "isRead": <some other number>
}
of which I would have one for each feed id.
* Obviously, since this is not an output I can have on each doc with a
map, I will not emit this for everyone and just merge them. But I emit
the same structure only with the fields I know :
For a feed:
  {
    "title": <title>
  }
For an entry:
  {
    "isUnread": <1 if it is unread or...>,
    "isRead": <... 1 if it is read>
  }
both with the feed id as a key, so I can call the views reduced and
grouped exactly and have one structure for each id
* This was for the "map" part. The "reduce" part now only has to take
all those structures and merge them :
    * "merge" the titles, but they should be the same (remember : I
call the view reduced, so the feed_id will be the same, thus the title
will be the same)
    * "merge" the state count by just adding them
* BUT this view has to be grouped _exactly_, which means I will have
one stats structure for each id. If I don't group them exactly, the
output will be nonsense, since titles are merged with no idea which
one will be the last.

As a conclusion, I would say that couchDB's map/reduce views ask for a
total rethinking of how you organize your data. I set out with this
because this is how I would have made my docs, but this means I have
to do some kind of joins.
Depending on your application needs, you might want to try put that
relational model aside and arrange your data in another way, such that
retrieving the data your application wants (which is, in the end, all
that matters) is easier than understanding
the data structure at first sight (which is important only at
prototyping/debug time, and this should be negligible against usage
time)

-- 
Matthieu RAKOTOJAONA

Mime
View raw message