Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 29373 invoked from network); 30 Sep 2010 23:10:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Sep 2010 23:10:23 -0000 Received: (qmail 52280 invoked by uid 500); 30 Sep 2010 23:10:21 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 52197 invoked by uid 500); 30 Sep 2010 23:10:21 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 52189 invoked by uid 99); 30 Sep 2010 23:10:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Sep 2010 23:10:21 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of randall.leeds@gmail.com designates 209.85.161.52 as permitted sender) Received: from [209.85.161.52] (HELO mail-fx0-f52.google.com) (209.85.161.52) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Sep 2010 23:10:14 +0000 Received: by fxm17 with SMTP id 17so2366536fxm.11 for ; Thu, 30 Sep 2010 16:09:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=HJ44KGRWn0osVlAHCS9BKl2oqEw7A7tW0eiVvqAYQbQ=; b=oDq/nAR4z6oyezcLgt/MJOBzkn6qWRVZ2tLfHy4DjMUuyNsMIfcMB65vY81OCa/zFf g0isPHdw9anc7AfOd8TAlXm8xHYWenp3dvCCL70/IeVx5UD3n/+MaOYYj7HjfX9swjDB yF+8Q5aNNJIwf32sKPsfucm3qKUtKkh8bDrDg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=hCO1byWyV9nPpWO9fwUIk6aIV01l6bsqNmxPZFtvn4pB7fTLKPQaOpDmSnHEdU2xjs keDaXDxNn2hq6rZOwafg5jEYwAJL32Yn/uqZKwTbOBXt7NSfA6h0Lkqeee3NKPlz7wPq TYBYMNBXOy0pUX5qn/WHF6iCCwqSfyFBKSrP8= MIME-Version: 1.0 Received: by 10.223.114.136 with SMTP id e8mr4634129faq.88.1285888193641; Thu, 30 Sep 2010 16:09:53 -0700 (PDT) Received: by 10.223.5.215 with HTTP; Thu, 30 Sep 2010 16:09:53 -0700 (PDT) In-Reply-To: References: Date: Thu, 30 Sep 2010 16:09:53 -0700 Message-ID: Subject: Re: CouchDB for data mining From: Randall Leeds To: user@couchdb.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Remarks inline. On Tue, Sep 28, 2010 at 13:40, Christopher Bare wrote: > Hi, Hey there! > > I'm looking into CouchDB for a data mining application. I'm a noob, so > I'm just getting an appreciation for the new (and very creative) > approach taken with Couch. Please let me first verify that I have a > few things straight: > > A view is a lot more like an index than a query in SQL terms. The keys > emitted from the mapper are used to construct a b-tree. Aggregate > values computed in the reducer may be hung on the higher nodes of the > tree. Constructing this tree is an expensive operation, but read > access is fast and it can be updated incrementally as the underlying > data changes. (Baron Schwartz's A Gentle Introduction to CouchDB for > Relational Practitioners explains this nicely.) Exactly right. It's clear you've done your reading :). > > A view is formulated using the map-reduce (MR) pattern, which > essentially divides a big job into lots of small independent subtasks. > In Hadoop and Google's MR, that independence is used for parallelism > in distributed environment. Couch's use of MR is very different. I'm > not sure how parallelism comes into play in Couch, but it seems to me > a key feature of Couch is that the independence of MR is exploited to > compute and cache partial results in the b-tree and to update them > incrementally. Correct. However, there's nothing about the design of couch's MR that prevents parallelization, though right now it's only exploited be the 3rd-party clustering solutions. > > The targeted here is the "shit-loads of users" scenario where the cost > of building and maintaining the view can be amortized over lots of > read operations. > > Now, if that's all more-or-less right, how does that apply to data mining= ? > > In a data mining app, you typically have lots of ad-hoc queries. > You'll read that Couch doesn't do ad-hoc queries, but I have a feeling > that, if you're smart about it, you can create views that will serve > as the basis for whole classes of queries. The view will do part of > the work and your client code will have to do part as well. I haven't > quite gotten my head around how this is done, nor around how Couch's > list functions might fit into the picture. You're absolutely right about smart view design. Most questions get resolved with some kind of smart view. Anything that doesn't fit this can generally be solved with a little more work and some addons. For example, FTI gets you a bunch of ad-hoc queries you can't otherwise do and there are ways to add this to CouchDB today, though nothing in the official source. > > It would be great to have an example data mining app for Couch. The > classic textbook example is co-occurrence of items in a large database > of grocery store shopping baskets. You ask questions like, "If a > customer buys diapers, do they also buy beer?" It will come as little > surprise to any new parents that, in fact, they do. In this case, > you're documents would consist of a set of purchased items and > associated information like customer demographics, geographic > information, sales and promotions, etc. which are usually modeled in > terms of a star schema in an RDBMS. The task is then to ask the same > basic questions about what people buy sliced and diced by or > conditioned on the associated data, like, "Do males in the pacific > northwest buy diapers and beer when beer is on sale?" I don't have any links offhand, but I know there have been some blog posts about some of these topics. If you do the research and come up with a nice list please start a wiki page to collect them. I think it'd be a great resource. > > Is something like that an appropriate use case for Couch? It would be > awesome to have some guidance from the gurus on applications like > this, which are very different from either transaction processing or > the highly-available eventual-consistency use-cases often associated > with NoSQL. I don't see any reason yet why you *shouldn't* use CouchDB. However, I won't say you're not still pretty early to the big data party, so it'll probably take some trailblazing. > > Thanks! > > -- =C2=A0Chris > You're totally welcome! Feel free to keep the questions coming or hop on #couchdb on freenode if you want more real-time feedback from the community and devs. -Randall