Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 3650 invoked from network); 7 Feb 2010 23:30:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Feb 2010 23:30:28 -0000 Received: (qmail 93915 invoked by uid 500); 7 Feb 2010 23:30:27 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 93819 invoked by uid 500); 7 Feb 2010 23:30:27 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 93809 invoked by uid 99); 7 Feb 2010 23:30:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Feb 2010 23:30:27 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.211.185 as permitted sender) Received: from [209.85.211.185] (HELO mail-yw0-f185.google.com) (209.85.211.185) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Feb 2010 23:30:19 +0000 Received: by ywh15 with SMTP id 15so575719ywh.5 for ; Sun, 07 Feb 2010 15:29:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=UyapfdkLIeCcM8W7khhLQ2ZEOFay3kS13MDMqkaev10=; b=wVlZWZW6bEjVFkGfzH4yUJUOdnXbJ+MvV1dPgD34OuTRrhbmhomCbu6Y2jXAkJP04G SRwM4Tb5c9u9At5l1ZSD/Ev/LSqF+tnm0J8mvpBVOXnhi+tkhWfBAfsW7EGVrmcxuOul kydXaNc8jFyDwjEP6dGeQfrKzlP/9HiA0dpL0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=sq3SBY/gPIFOQDSCo/xwVPjpmNlyc1MS3OHBk5Tz+KFGz6ft8P+56Pn1cMGzLZ7lSo TtQ1IAbEmxMmpAu+QiuCKywCnmbxESVX//PPx6Y3pKa1KhUhUj7xKAdoZBTWHshdfCcc 00pn3RmwLljGZKdt4z3pKf/xbdHymUIGV/niE= MIME-Version: 1.0 Received: by 10.101.157.21 with SMTP id j21mr7303910ano.16.1265585398115; Sun, 07 Feb 2010 15:29:58 -0800 (PST) In-Reply-To: References: From: Paul Davis Date: Sun, 7 Feb 2010 18:29:38 -0500 Message-ID: Subject: Re: two view questions: group=true, inverted indices To: user@couchdb.apache.org, hrldcpr@gmail.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Sun, Feb 7, 2010 at 6:15 PM, Harold Cooper wrote: > Hi there, > > I'm new to CouchDB and have two questions about the use of mapreduce > in views. > > 1. > As far as I can tell, even when I pass group=3Dtrue to a view, > reduce(keys, values) is still passed different keys, > e.g. keys =3D [["a", "551a50e574ccd439af28428db2401ab4"], > ["b", "94d13f9e969786c6d653555a7e94f61e"]]. > Even when you query with group=3Dtrue, the ungrouped reduction is still calculated. Generally you should be able to just ignore such things. > Isn't the whole point of group=3Dtrue that this shouldn't happen? > > > 2. > When I've read about mapreduce before, a classic example use is > constructing an inverted index. But if I make a view like: > { > map: "function(doc) { > =A0var words =3D doc.text.split(' '); > =A0for (var i in words) { > =A0 =A0emit(words[i], [doc._id]); > =A0} > }", > reduce: "function(keys, values) { > =A0// concatenate the lists of docIds together: > =A0return Array.prototype.concat.apply([], values); > }" > } > then couchdb complains that the reduce result is growing too fast. > > I did read that this is the way things are, but it's too bad because > it would be a perfect application of mapreduce, and the only other > text search option I've heard of is couchdb-lucene which doesn't > sound nearly as fun/elegant. > > Is there another way to approach this? > Should I just not reduce and end up with one row per word-occurrence? CouchDB Map/Reduce isn't like Google Map/Reduce. Its much more like the old school map/reduce pattern that expects to be calculating a single reduction value. The CouchDB internals make doing things like inverted indices hard. The 'proper' way would be to do as you say and return a single row per key with only some intermediary values handled by reductions. Also, while couchdb-lucene may not present near as much fun, its got quite a bit to it. Full-Text indexing is hard. Many examples show it as nothing more than an inverted index, but that's hiding 95% of the knowledge on information retrieval and scoring algorithms that are in Lucene. And there's the integration with Tika to do things like attachment indexing. I quite dislike Java but I've come to accept that there really isn't much competition that's compatible with CouchDB's document model. HTH, Paul Davis > Thanks for any help, > and sorry if this has been covered before, I did try to search around fir= st. > -- > Harold >