systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Frederico <gustavo.freder...@thinkwrap.com>
Subject Comparing scikit-learn, Mahout Samsara and SystemML
Date Tue, 06 Jun 2017 01:56:09 GMT
Greetings, 

I worked with the theory of SVMs during my Graduate studies and I’m relatively new to existing
ML software. Assuming that I want to create new scalable ML algorithms starting with the Math,
the question is: how do scikit-learn, Mahout Samsara and SystemML compare to each other?

I see interesting Python-based frameworks such as scikit-learn, but then I read SystemML's
article on Wikipedia that made me question the distributive scalability of (“pure") Python
for large amounts of data:

"[...] It was observed that data scientists would write machine learning algorithms in languages
such as R and Python for small data. When it came time to scale to big data, a systems programmer
would be needed to scale the algorithm in a language such as Scala. This process typically
involved days or weeks per iteration, and errors would occur translating the algorithms to
operate on big data. " ( https://en.wikipedia.org/wiki/Apache_SystemML )

And the article starts stating that Apache SystemML has "algorithm customizability via [...]
Python-like languages”.

Mahout Samsara is based on Scala. PredictionIO (predictionio.incubator.apache.org) algorithms
are based on Mahout Samsara and Scala.  I asked Mr. Matthias Boehm at a conference how one
could compare Mahout Samsara to SystemML. From what I understood, Samsara needs "explicit
declarations” in expressions for distributed computing, while SystemML doesn’t — please
correct me if I’m wrong. Also, SystemML will optimize the entire script, while Samsara will
optimize expressions — again, please correct me if I’m wrong.

While my main criterion is scalability (cluster, GPU support etc), other criteria to evaluate
these frameworks may be: a) public adoption, b) active dev community, c) quality of tools
for development, d) backing of big companies e) simplicity working with clusters (delegating
the complexities of clustering to the framework, “hiding” them from the user), f) quality
of documentation, g) quality of the software itself

( My question was deleted from stats.stackexchange.com for being off-topic and deleted from
Stack Overflow for being bound to get answers with "opinions rather than facts” [sic]. I’m
very much interested in hearing balanced and insightful comments from the list. )

Thank you,

Gustavo
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message