cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adaryl \"Bob\" Wakefield, MBA" <>
Subject Re: Machine Learning With Cassandra
Date Sat, 30 Aug 2014 20:02:30 GMT
Yes I remember this conversation. That was when I was just first stepping into this stuff.
My current understanding is:
Storm = Stream and micro batch
Spark  = Batch and micro batch

Micro batching is what gets you to exactly once processing semantics. I’m clear on that.
What I’m not clear on is how and where processing takes place.

I also get the fact that Spark is a faster execution engine than MapReduce. But we have Tez
now..except, as far as I know, that’s not useful here because my data isn’t in HDFS. People
seem to be talking quite a bit about Mahout and Spark Shell but I’d really like to get this
done with a minimum amount of software; either Storm or Spark but not both.  

Trident ML isn’t distributed which is fine because I’m not trying to do learning on the
stream. For now, I’m just trying to do learning in batch and then update parameters as suggested

Let me simply the question. How do I do distributed machine learning when my data is in Cassandra
and not HDFS? I haven’t totally explored mahout yet but a lot of the algorithms run on MapReduce
which is fine for now. As I understand it though, MapReduce works on data in HDFS correct?

Adaryl "Bob" Wakefield, MBA
Mass Street Analytics
Twitter: @BobLovesData

From: Shahab Yunus 
Sent: Saturday, August 30, 2014 11:23 AM
Subject: Re: Machine Learning With Cassandra

Spark is not storage, rather it is a streaming framework supposed to be run on big data, distributed
architecture (a very high-level intro/definition). It provides batched version of in-memory
map/reduce like jobs. It is not completely streaming like Storm but rather batches collection
of tuples and thus you can run complex ML algorithms relatively faster.  

I think we just discussed this a short while ago when similar question (storm vs. spark, I
think) was raised by you earlier. Here is the link for that discussion:


On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <>

  Isn’t a bit overkill to use Storm and Spark in the architecture? You say load it “into”
Spark. Is Spark separate storage?


  From: Alex Kamil 
  Sent: Friday, August 29, 2014 10:46 PM
  Subject: Re: Machine Learning With Cassandra


  most ML algorithms  are based on some form of numerical optimization, using something like
online gradient descent or conjugate gradient (e.g in SVM classifiers). In its simplest form
it is a nested FOR loop where on each iteration you update the weights or parameters of the
model until reaching some convergence threshold that minimizes the prediction error (usually
the goal is to minimize  a Loss function, as in a popular least squares technique). You could
parallelize this loop using a brute force divide-and-conquer approach, mapping a chunk of
data to each node and a computing partial sum there, then aggregating the results from each
node into a global sum in a 'reduce' stage, and repeating this map-reduce cycle until convergence.
You can look up distributed gradient descent or check out Mahout or Spark MLlib for examples.
Alternatively you can use something like GraphLab.

  Cassandra can serve a data store from which you load the training data e.g. into Spark 
using this connector and then train the model using MLlib or Mahout (it has Spark bindings
I believe). Once you trained the model, you could save the parameters back in Cassandra. Then
the next stage is using the model to classify new data, e.g. recommend similar items based
on a log of new purchases, there you could once again use Spark or Storm with something like


  On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <>

    I’m planning to speak at a local meet-up and I need to know if what I have in my head
is even possible.
    I want to give an example of working with data in Cassandra. I have data coming in through
Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point).
I then want to run an ML algorithm over the data. My problem here is, while my data is distributed,
I don’t know how to do the analysis in a distributed manner. I could certainly use R but
processing the data on a single machine would seem to defeat the purpose of all this scalability.
    What is my solution?

View raw message