Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of randall.leeds@gmail.com
 designates 209.85.161.52 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=hCO1byWyV9nPpWO9fwUIk6aIV01l6bsqNmxPZFtvn4pB7fTLKPQaOpDmSnHEdU2xjs
         keDaXDxNn2hq6rZOwafg5jEYwAJL32Yn/uqZKwTbOBXt7NSfA6h0Lkqeee3NKPlz7wPq
         TYBYMNBXOy0pUX5qn/WHF6iCCwqSfyFBKSrP8=
MIME-Version: 1.0
In-Reply-To: <AANLkTikfVBm30Fnycpo2gjqwj6zVVFx=rUA_hb1b+Ba3@mail.gmail.com>
References: <AANLkTikfVBm30Fnycpo2gjqwj6zVVFx=rUA_hb1b+Ba3@mail.gmail.com>
Date: Thu, 30 Sep 2010 16:09:53 -0700
Message-ID: <AANLkTi=fwe+a3kJ6i9uCArH8vGSCw92+FJLQ7E6KDhoP@mail.gmail.com>
Subject: Re: CouchDB for data mining
From: Randall Leeds <randall.leeds@gmail.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Remarks inline.

On Tue, Sep 28, 2010 at 13:40, Christopher Bare
<cbare@systemsbiology.org> wrote:
> Hi,

Hey there!

>
> I'm looking into CouchDB for a data mining application. I'm a noob, so
> I'm just getting an appreciation for the new (and very creative)
> approach taken with Couch. Please let me first verify that I have a
> few things straight:
>
> A view is a lot more like an index than a query in SQL terms. The keys
> emitted from the mapper are used to construct a b-tree. Aggregate
> values computed in the reducer may be hung on the higher nodes of the
> tree. Constructing this tree is an expensive operation, but read
> access is fast and it can be updated incrementally as the underlying
> data changes. (Baron Schwartz's A Gentle Introduction to CouchDB for
> Relational Practitioners explains this nicely.)

Exactly right. It's clear you've done your reading :).

>
> A view is formulated using the map-reduce (MR) pattern, which
> essentially divides a big job into lots of small independent subtasks.
> In Hadoop and Google's MR, that independence is used for parallelism
> in distributed environment. Couch's use of MR is very different. I'm
> not sure how parallelism comes into play in Couch, but it seems to me
> a key feature of Couch is that the independence of MR is exploited to
> compute and cache partial results in the b-tree and to update them
> incrementally.

Correct. However, there's nothing about the design of couch's MR that
prevents parallelization, though right now it's only exploited be the
3rd-party clustering solutions.

>
> The targeted here is the "shit-loads of users" scenario where the cost
> of building and maintaining the view can be amortized over lots of
> read operations.
>
> Now, if that's all more-or-less right, how does that apply to data mining=
?
>
> In a data mining app, you typically have lots of ad-hoc queries.
> You'll read that Couch doesn't do ad-hoc queries, but I have a feeling
> that, if you're smart about it, you can create views that will serve
> as the basis for whole classes of queries. The view will do part of
> the work and your client code will have to do part as well. I haven't
> quite gotten my head around how this is done, nor around how Couch's
> list functions might fit into the picture.

You're absolutely right about smart view design. Most questions get
resolved with some kind of smart view.
Anything that doesn't fit this can generally be solved with a little
more work and some addons. For example, FTI gets you a bunch of ad-hoc
queries you can't otherwise do and there are ways to add this to
CouchDB today, though nothing in the official source.

>
> It would be great to have an example data mining app for Couch. The
> classic textbook example is co-occurrence of items in a large database
> of grocery store shopping baskets. You ask questions like, "If a
> customer buys diapers, do they also buy beer?" It will come as little
> surprise to any new parents that, in fact, they do. In this case,
> you're documents would consist of a set of purchased items and
> associated information like customer demographics, geographic
> information, sales and promotions, etc. which are usually modeled in
> terms of a star schema in an RDBMS. The task is then to ask the same
> basic questions about what people buy sliced and diced by or
> conditioned on the associated data, like, "Do males in the pacific
> northwest buy diapers and beer when beer is on sale?"

I don't have any links offhand, but I know there have been some blog
posts about some of these topics.
If you do the research and come up with a nice list please start a
wiki page to collect them. I think it'd be a great resource.

>
> Is something like that an appropriate use case for Couch? It would be
> awesome to have some guidance from the gurus on applications like
> this, which are very different from either transaction processing or
> the highly-available eventual-consistency use-cases often associated
> with NoSQL.

I don't see any reason yet why you *shouldn't* use CouchDB. However, I
won't say you're not still pretty early to the big data party, so
it'll probably take some trailblazing.

>
> Thanks!
>
> -- =C2=A0Chris
>

You're totally welcome!
Feel free to keep the questions coming or hop on #couchdb on freenode
if you want more real-time feedback from the community and devs.

-Randall