incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: two view questions: group=true, inverted indices
Date Sun, 07 Feb 2010 23:29:38 GMT
On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper <harold@mit.edu> wrote:
> Hi there,
>
> I'm new to CouchDB and have two questions about the use of mapreduce
> in views.
>
> 1.
> As far as I can tell, even when I pass group=true to a view,
> reduce(keys, values) is still passed different keys,
> e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>

Even when you query with group=true, the ungrouped reduction is still
calculated. Generally you should be able to just ignore such things.

> Isn't the whole point of group=true that this shouldn't happen?
>
>
> 2.
> When I've read about mapreduce before, a classic example use is
> constructing an inverted index. But if I make a view like:
> {
> map: "function(doc) {
>  var words = doc.text.split(' ');
>  for (var i in words) {
>    emit(words[i], [doc._id]);
>  }
> }",
> reduce: "function(keys, values) {
>  // concatenate the lists of docIds together:
>  return Array.prototype.concat.apply([], values);
> }"
> }
> then couchdb complains that the reduce result is growing too fast.
>
> I did read that this is the way things are, but it's too bad because
> it would be a perfect application of mapreduce, and the only other
> text search option I've heard of is couchdb-lucene which doesn't
> sound nearly as fun/elegant.
>
> Is there another way to approach this?
> Should I just not reduce and end up with one row per word-occurrence?

CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like
the old school map/reduce pattern that expects to be calculating a
single reduction value. The CouchDB internals make doing things like
inverted indices hard. The 'proper' way would be to do as you say and
return a single row per key with only some intermediary values handled
by reductions.

Also, while couchdb-lucene may not present near as much fun, its got
quite a bit to it. Full-Text indexing is hard. Many examples show it
as nothing more than an inverted index, but that's hiding 95% of the
knowledge on information retrieval and scoring algorithms that are in
Lucene. And there's the integration with Tika to do things like
attachment indexing. I quite dislike Java but I've come to accept that
there really isn't much competition that's compatible with CouchDB's
document model.

HTH,
Paul Davis

> Thanks for any help,
> and sorry if this has been covered before, I did try to search around first.
> --
> Harold
>

Mime
View raw message