incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Senthilkumar Peelikkampatti <senthilkumar.peelikkampa...@gmail.com>
Subject Re: two view questions: group=true, inverted indices
Date Wed, 17 Feb 2010 02:30:29 GMT
couchdb needs native FTI and Erlang has few options available in that
space. I am aware of some experiment going on
http://github.com/bdionne/indexer. I think couchdb committers should
support  and encourage this kind of initiative.

On Mon, Feb 8, 2010 at 5:31 AM, Robert Dionne
<dionne@dionne-associates.com> wrote:
>
>
>
> On Feb 7, 2010, at 6:29 PM, Paul Davis wrote:
>
>> On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper <harold@mit.edu> wrote:
>>> Hi there,
>>>
>>> I'm new to CouchDB and have two questions about the use of mapreduce
>>> in views.
>>>
>>> 1.
>>> As far as I can tell, even when I pass group=true to a view,
>>> reduce(keys, values) is still passed different keys,
>>> e.g. keys = [["a", "551a50e574ccd439af28428db2401ab4"],
>>> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>>>
>>
>> Even when you query with group=true, the ungrouped reduction is still
>> calculated. Generally you should be able to just ignore such things.
>>
>>> Isn't the whole point of group=true that this shouldn't happen?
>>>
>>>
>>> 2.
>>> When I've read about mapreduce before, a classic example use is
>>> constructing an inverted index. But if I make a view like:
>>> {
>>> map: "function(doc) {
>>>  var words = doc.text.split(' ');
>>>  for (var i in words) {
>>>    emit(words[i], [doc._id]);
>>>  }
>>> }",
>>> reduce: "function(keys, values) {
>>>  // concatenate the lists of docIds together:
>>>  return Array.prototype.concat.apply([], values);
>>> }"
>>> }
>>> then couchdb complains that the reduce result is growing too fast.
>>>
>>> I did read that this is the way things are, but it's too bad because
>>> it would be a perfect application of mapreduce, and the only other
>>> text search option I've heard of is couchdb-lucene which doesn't
>>> sound nearly as fun/elegant.
>>>
>>> Is there another way to approach this?
>>> Should I just not reduce and end up with one row per word-occurrence?
>>
>> CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like
>> the old school map/reduce pattern that expects to be calculating a
>> single reduction value. The CouchDB internals make doing things like
>> inverted indices hard. The 'proper' way would be to do as you say and
>> return a single row per key with only some intermediary values handled
>> by reductions.
>>
>> Also, while couchdb-lucene may not present near as much fun, its got
>> quite a bit to it. Full-Text indexing is hard. Many examples show it
>> as nothing more than an inverted index, but that's hiding 95% of the
>> knowledge on information retrieval and scoring algorithms that are in
>> Lucene. And there's the integration with Tika to do things like
>> attachment indexing. I quite dislike Java but I've come to accept that
>> there really isn't much competition that's compatible with CouchDB's
>> document model.
>>
>
> I think it does have challenges and couchdb-lucene offers a good solution for most use
cases, plus it's mature and well known, but
> at some point, perhaps post 1.0 I think a native FTI implementation will add a lot of
value to CouchDB if only by removing the dependency
> on Java.
>
>
>
>
>
>
>> HTH,
>> Paul Davis
>>
>>> Thanks for any help,
>>> and sorry if this has been covered before, I did try to search around first.
>>> --
>>> Harold
>>>
>
>



-- 
Regards,
Senthilkumar Peelikkampatti,
http://pmsenthilkumar.blogspot.com/

Mime
View raw message