couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Dionne <dio...@dionne-associates.com>
Subject Re: two view questions: group=true, inverted indices
Date Mon, 08 Feb 2010 11:31:04 GMT



On Feb 7, 2010, at 6:29 PM, Paul Davis wrote:

> On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper <harold@mit.edu> wrote:
>> Hi there,
>> 
>> I'm new to CouchDB and have two questions about the use of mapreduce
>> in views.
>> 
>> 1.
>> As far as I can tell, even when I pass group=true to a view,
>> reduce(keys, values) is still passed different keys,
>> e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
>> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>> 
> 
> Even when you query with group=true, the ungrouped reduction is still
> calculated. Generally you should be able to just ignore such things.
> 
>> Isn't the whole point of group=true that this shouldn't happen?
>> 
>> 
>> 2.
>> When I've read about mapreduce before, a classic example use is
>> constructing an inverted index. But if I make a view like:
>> {
>> map: "function(doc) {
>>  var words = doc.text.split(' ');
>>  for (var i in words) {
>>    emit(words[i], [doc._id]);
>>  }
>> }",
>> reduce: "function(keys, values) {
>>  // concatenate the lists of docIds together:
>>  return Array.prototype.concat.apply([], values);
>> }"
>> }
>> then couchdb complains that the reduce result is growing too fast.
>> 
>> I did read that this is the way things are, but it's too bad because
>> it would be a perfect application of mapreduce, and the only other
>> text search option I've heard of is couchdb-lucene which doesn't
>> sound nearly as fun/elegant.
>> 
>> Is there another way to approach this?
>> Should I just not reduce and end up with one row per word-occurrence?
> 
> CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like
> the old school map/reduce pattern that expects to be calculating a
> single reduction value. The CouchDB internals make doing things like
> inverted indices hard. The 'proper' way would be to do as you say and
> return a single row per key with only some intermediary values handled
> by reductions.
> 
> Also, while couchdb-lucene may not present near as much fun, its got
> quite a bit to it. Full-Text indexing is hard. Many examples show it
> as nothing more than an inverted index, but that's hiding 95% of the
> knowledge on information retrieval and scoring algorithms that are in
> Lucene. And there's the integration with Tika to do things like
> attachment indexing. I quite dislike Java but I've come to accept that
> there really isn't much competition that's compatible with CouchDB's
> document model.
> 

I think it does have challenges and couchdb-lucene offers a good solution for most use cases,
plus it's mature and well known, but
at some point, perhaps post 1.0 I think a native FTI implementation will add a lot of value
to CouchDB if only by removing the dependency
on Java. 






> HTH,
> Paul Davis
> 
>> Thanks for any help,
>> and sorry if this has been covered before, I did try to search around first.
>> --
>> Harold
>> 


Mime
View raw message