Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 92409 invoked from network); 27 Apr 2009 04:20:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Apr 2009 04:20:40 -0000 Received: (qmail 94687 invoked by uid 500); 27 Apr 2009 04:20:40 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 94593 invoked by uid 500); 27 Apr 2009 04:20:39 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 94583 invoked by uid 99); 27 Apr 2009 04:20:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Apr 2009 04:20:39 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jchris@gmail.com designates 209.85.132.250 as permitted sender) Received: from [209.85.132.250] (HELO an-out-0708.google.com) (209.85.132.250) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Apr 2009 04:20:29 +0000 Received: by an-out-0708.google.com with SMTP id b6so1194713ana.5 for ; Sun, 26 Apr 2009 21:20:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:references:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:cc:x-mailer; bh=PSOi2FWHXz/2GtjbywTFGXbuoR5mP3E+bdU9ehwoybM=; b=PmtHS4n/4dkePi+J0ffonMyJaLiotmpiVqLHk/q5Lxn8tNm86FWRvWmYqGIMqtGgy5 dG31UAjvFQu250fCY6Fwjgkbc1nzVYp8w4gUvhW8zJWvyJxucWoEwYqHocoAw2Jk6+Nq CFQpISX41hXMdg0A3ZcFQ7Esyz+lMRt4xUKN4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=references:message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:cc:x-mailer; b=HyuAT22UNSmFSi6pqJI7UYknU/gdtJ+OMWDM6Jjw5qTj6RQ9o7oNYcb2PyhYu2xUv3 D/AtQDX7HV0awAiQF6fHvSmuSr2v3CMKv4p+6byXqLp7vkOs1xLcFBhZOnIN3kqoz0ta HwaPdeS3gvWBUjkzDX0d+HNtmU+462qqTeXws= Received: by 10.100.14.2 with SMTP id 2mr7863530ann.68.1240806008684; Sun, 26 Apr 2009 21:20:08 -0700 (PDT) Received: from ?10.92.233.240? ([32.158.99.25]) by mx.google.com with ESMTPS id c29sm9325539anc.10.2009.04.26.21.20.04 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 26 Apr 2009 21:20:07 -0700 (PDT) References: <5FCE560A-D08C-4155-8B2B-E315EDF76037@gmail.com> <1B105501-DF81-425F-AD1A-CBDF74E6FDF6@apache.org> <6801D7CA-88A7-4D67-82C5-0E912F06DA7C@gmail.com> Message-Id: From: Chris Anderson To: dev@couchdb.apache.org In-Reply-To: <6801D7CA-88A7-4D67-82C5-0E912F06DA7C@gmail.com> Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (iPhone Mail 5H11) Subject: Re: Proposal: Review DBs Date: Sun, 26 Apr 2009 20:20:15 -0700 Cc: Adam Kocoloski , dev@couchdb.apache.org X-Mailer: iPhone Mail (5H11) X-Virus-Checked: Checked by ClamAV on apache.org Sent from my iPhone On Apr 26, 2009, at 2:26 PM, Wout Mertens wrote: > Hi Adam, > > On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote: > >> Hi Wout, thanks for writing this up. >> >> One comment about the map-only views: I think you'll find that >> Couch has already done a good bit of the work needed to support >> them, too. Couch maintains a btree for each design doc keyed on >> docid that stores all the view keys emitted by the maps over each >> document. When a document is updated and then analyzed, Couch has >> to consult that btree, purge all the KVs associated with the old >> version of the doc from each view, and then insert the new KVs. So >> the tracking information correlating docids and view keys is >> already available. > > See I did not know that :-) Although I should have guessed. > > However, in the mail before this one I argued that it doesn't make > sense to combine or chain map-only views since you can always write > a map function that does it in one step. Do you agree? > > You might also know the answer to this: is it possible to make the > Review DB be a sort of view index on the current database? All it > needs are JSON keys and values, no other fields. > >> You'd still be left with the problem of generating unique docids >> for the documents in the Review DB, but I think that's a problem >> that needs to be solved. The restriction to only MR views with no >> duplicate keys across views seems too strong to me. > > Well, since the Review DB is a local(*) hidden database that's > handled a bit specially, I think the easiest is to assign _id a > sequence number and create a default view that indexes the documents > by doc.key (for updating the value for that key). There will never > be contention and we're only interested in the key index. We discussed this a little at CouchHack and I argued that the simplest solution is actually good for a few reasons. The simple solution: provide a mechanism to copy the rows of a grouped reduce function to a new database. Good because it is most like Hadoop/Google style map reduce. In that paradigm, the output of a map/reduce job is not incremental, and it is persisted in a way that allows for multiple later reduce stages to be run on it. It's common in Hadoop to chain many m/r stages, and to try a few iterations of each stage while developing code. I like this also because it provides the needed functionality without adding any new primitives to CouchDB. The only downside of this approach is that it is not incremental. I'm not sure that incremental chainability has much promise, as the index management could be a pain, especially if you have branching chains. Another upside is that by reducing to a db, you give the user power to do things like use replication to merge multiple data sets before applying more views. I don't want to discourage anyone from experimenting with code, just want to point out this simple solution which would be Very Easy to implement. > > > (*)local: I'm assuming that views are not replicated and need to be > recalculated for each CouchDB node. If they are replicated somehow, > I think it would still work but we'd have to look at it a little more. > >> With that said, I'd prefer to spend my time extending the view >> engine to handle chainable MR workflows in a single shot. >> Especially in the simple sort_by_value case it just seems like a >> cleaner way to go about things. > > Yes, that seems to be the gist of all repliers and I agree :-) > > In a nutshell, I'm hoping that: > * A review is a new sort of view that has an "inputs" array in its > definition. > * Only MR views are allowed as inputs, no KV duplication allowed. > * It builds a persistent index of the incoming views when those get > updated. > * That index is then used to build the view index for the review > when the review gets updated. > * I think I covered the most important algorithms needed to > implement this in my original proposal. > > Does this sound feasible? If so I'll update my proposal accordingly. > > Wout.