Isn’t a bit overkill to use Storm and Spark in the architecture? You say load it “into”
Spark. Is Spark separate storage?
B.
From: Alex Kamil
Sent: Friday, August 29, 2014 10:46 PM
To: user@cassandra.apache.org
Subject: Re: Machine Learning With Cassandra
Adaryl,
most ML algorithms are based on some form of numerical optimization, using something like
online gradient descent or conjugate gradient (e.g in SVM classifiers). In its simplest form
it is a nested FOR loop where on each iteration you update the weights or parameters of the
model until reaching some convergence threshold that minimizes the prediction error (usually
the goal is to minimize a Loss function, as in a popular least squares technique). You could
parallelize this loop using a brute force divideandconquer approach, mapping a chunk of
data to each node and a computing partial sum there, then aggregating the results from each
node into a global sum in a 'reduce' stage, and repeating this mapreduce cycle until convergence.
You can look up distributed gradient descent or check out Mahout or Spark MLlib for examples.
Alternatively you can use something like GraphLab.
Cassandra can serve a data store from which you load the training data e.g. into Spark using
this connector and then train the model using MLlib or Mahout (it has Spark bindings I believe).
Once you trained the model, you could save the parameters back in Cassandra. Then the next
stage is using the model to classify new data, e.g. recommend similar items based on a log
of new purchases, there you could once again use Spark or Storm with something like this.
Alex
On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>
wrote:
I’m planning to speak at a local meetup and I need to know if what I have in my head
is even possible.
I want to give an example of working with data in Cassandra. I have data coming in through
Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point).
I then want to run an ML algorithm over the data. My problem here is, while my data is distributed,
I don’t know how to do the analysis in a distributed manner. I could certainly use R but
processing the data on a single machine would seem to defeat the purpose of all this scalability.
What is my solution?
B.
