Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of paul.joseph.davis@gmail.com
 designates 209.85.211.185 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=sq3SBY/gPIFOQDSCo/xwVPjpmNlyc1MS3OHBk5Tz+KFGz6ft8P+56Pn1cMGzLZ7lSo
         TtQ1IAbEmxMmpAu+QiuCKywCnmbxESVX//PPx6Y3pKa1KhUhUj7xKAdoZBTWHshdfCcc
         00pn3RmwLljGZKdt4z3pKf/xbdHymUIGV/niE=
MIME-Version: 1.0
In-Reply-To: <fc9cb8941002071515x1a5cae8eha8fe569068ac3232@mail.gmail.com>
References: <fc9cb8941002071515x1a5cae8eha8fe569068ac3232@mail.gmail.com>
From: Paul Davis <paul.joseph.davis@gmail.com>
Date: Sun, 7 Feb 2010 18:29:38 -0500
Message-ID: <e2111bbb1002071529t6aaa8e23p908ef7af72d0075e@mail.gmail.com>
Subject: Re: two view questions: group=true, inverted indices
To: user@couchdb.apache.org, hrldcpr@gmail.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper <harold@mit.edu> wrote:
> Hi there,
>
> I'm new to CouchDB and have two questions about the use of mapreduce
> in views.
>
> 1.
> As far as I can tell, even when I pass group=3Dtrue to a view,
> reduce(keys, values) is still passed different keys,
> e.g. keys =3D [["a", "551a50e574ccd439af28428db2401ab4"],
> ["b", "94d13f9e969786c6d653555a7e94f61e"]].
>

Even when you query with group=3Dtrue, the ungrouped reduction is still
calculated. Generally you should be able to just ignore such things.

> Isn't the whole point of group=3Dtrue that this shouldn't happen?
>
>
> 2.
> When I've read about mapreduce before, a classic example use is
> constructing an inverted index. But if I make a view like:
> {
> map: "function(doc) {
> =A0var words =3D doc.text.split(' ');
> =A0for (var i in words) {
> =A0 =A0emit(words[i], [doc._id]);
> =A0}
> }",
> reduce: "function(keys, values) {
> =A0// concatenate the lists of docIds together:
> =A0return Array.prototype.concat.apply([], values);
> }"
> }
> then couchdb complains that the reduce result is growing too fast.
>
> I did read that this is the way things are, but it's too bad because
> it would be a perfect application of mapreduce, and the only other
> text search option I've heard of is couchdb-lucene which doesn't
> sound nearly as fun/elegant.
>
> Is there another way to approach this?
> Should I just not reduce and end up with one row per word-occurrence?

CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like
the old school map/reduce pattern that expects to be calculating a
single reduction value. The CouchDB internals make doing things like
inverted indices hard. The 'proper' way would be to do as you say and
return a single row per key with only some intermediary values handled
by reductions.

Also, while couchdb-lucene may not present near as much fun, its got
quite a bit to it. Full-Text indexing is hard. Many examples show it
as nothing more than an inverted index, but that's hiding 95% of the
knowledge on information retrieval and scoring algorithms that are in
Lucene. And there's the integration with Tika to do things like
attachment indexing. I quite dislike Java but I've come to accept that
there really isn't much competition that's compatible with CouchDB's
document model.

HTH,
Paul Davis

> Thanks for any help,
> and sorry if this has been covered before, I did try to search around fir=
st.
> --
> Harold
>