incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Bare <cb...@systemsbiology.org>
Subject CouchDB for data mining
Date Tue, 28 Sep 2010 20:40:33 GMT
Hi,

I'm looking into CouchDB for a data mining application. I'm a noob, so
I'm just getting an appreciation for the new (and very creative)
approach taken with Couch. Please let me first verify that I have a
few things straight:

A view is a lot more like an index than a query in SQL terms. The keys
emitted from the mapper are used to construct a b-tree. Aggregate
values computed in the reducer may be hung on the higher nodes of the
tree. Constructing this tree is an expensive operation, but read
access is fast and it can be updated incrementally as the underlying
data changes. (Baron Schwartz's A Gentle Introduction to CouchDB for
Relational Practitioners explains this nicely.)

A view is formulated using the map-reduce (MR) pattern, which
essentially divides a big job into lots of small independent subtasks.
In Hadoop and Google's MR, that independence is used for parallelism
in distributed environment. Couch's use of MR is very different. I'm
not sure how parallelism comes into play in Couch, but it seems to me
a key feature of Couch is that the independence of MR is exploited to
compute and cache partial results in the b-tree and to update them
incrementally.

The targeted here is the "shit-loads of users" scenario where the cost
of building and maintaining the view can be amortized over lots of
read operations.

Now, if that's all more-or-less right, how does that apply to data mining?

In a data mining app, you typically have lots of ad-hoc queries.
You'll read that Couch doesn't do ad-hoc queries, but I have a feeling
that, if you're smart about it, you can create views that will serve
as the basis for whole classes of queries. The view will do part of
the work and your client code will have to do part as well. I haven't
quite gotten my head around how this is done, nor around how Couch's
list functions might fit into the picture.

It would be great to have an example data mining app for Couch. The
classic textbook example is co-occurrence of items in a large database
of grocery store shopping baskets. You ask questions like, "If a
customer buys diapers, do they also buy beer?" It will come as little
surprise to any new parents that, in fact, they do. In this case,
you're documents would consist of a set of purchased items and
associated information like customer demographics, geographic
information, sales and promotions, etc. which are usually modeled in
terms of a star schema in an RDBMS. The task is then to ask the same
basic questions about what people buy sliced and diced by or
conditioned on the associated data, like, "Do males in the pacific
northwest buy diapers and beer when beer is on sale?"

Is something like that an appropriate use case for Couch? It would be
awesome to have some guidance from the gurus on applications like
this, which are very different from either transaction processing or
the highly-available eventual-consistency use-cases often associated
with NoSQL.

Thanks!

--  Chris

Mime
View raw message