mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <andrew.mussel...@gmail.com>
Subject Re: 0xdata interested in contributing
Date Thu, 13 Mar 2014 01:16:06 GMT
Sounds like a large positive step; looking forward to hearing more!

> On Mar 12, 2014, at 5:44 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
> I have been working with a company named 0xdata to help them contribute
> some new software to Mahout.  This software will give Mahout the ability to
> do highly iterative in-memory mathematical computations on a cluster or a
> single machine. This software also comes with high performance distributed
> implementations of k-means, logistic regression, random forest and other
> algorithms.
> 
> I will be starting a thread about this on the dev list shortly, but I
> wanted the PMC members to have a short heads up on what has been happening
> now that we have consensus on the 0xdata side of the game.
> 
> I think that this has a major potential to bring in an enormous amount of
> contributing community to Mahout.  Technically, it will, at a stroke, make
> Mahout the highest performing machine learning framework around.
> 
> *Development Roadmap*
> 
> Of the requirements that people have been talking about on the main mailing
> list, the following capabilities will be provided by this contribution:
> 
> 1) high performance distributed linear algebra
> 
> 2) basic machine learning codes including logistic regression, other
> generalized
> linear modeling codes, random forest, clustering
> 
> 3) standard file format parsing system (CSV, Lucene, parquet, other) x
>    (continuous, constant, categorical, word-like, text-like)
> 
> 4) standard web-based basic applications for common operations
> 
> 5) language bindings (Java, Scala, R, other)
> 
> 6) interactive + batch use
> 
> 7) common representation/good abstraction over representation
> 
> 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos, EC2,
> GCE )
> 
> 
> *Backstory*
> 
> I was recently approached by the Sri Satish, CEO and co-founder of 0xdata
> who
> wanted to explore whether they could donate some portion of the h2o
> framework and technology to Mahout.  I was skeptical since all that I had
> previously seen was the application level demos for this system and was not
> at all familiar with the technology underneath. One of the co-founders of
> 0xdata, however, is Cliff Click who was one of the co-authors of the server
> HotSpot compiler.  That alone made the offer worth examining.
> 
> Over the last few weeks, the technical team of 0xdata has been working with
> me to work out whether this contribution would be useful to Mahout.
> 
> My strong conclusion is that the donation, with some associated shim work
> that 0xdata is committing to doing will satisfy roughly 80% of the goals
> that have emerged other the last week or so of discussion.  Just as
> important, this donation connects Mahout to new communities who are very
> actively working at the frontiers machine learning which is likely to
> inject lots of new blood and excitement into the Mahout community.  This
> has huge potential outside of Mahout itself as well since having a very
> strong technical infrastructure that we can all use across many projects
> has the potential to have the same sort of impact on machine learning
> applications and products that Hadoop has had for file-based parallel
> processing.  Coming together on a common platform has the potential to
> create markets that would otherwise not exist if we don't have this
> commonality.
> 
> 
> *Technical Underpinnings*
> 
> At the lowest level, the h2o framework provides a way to have named objects
> stored in memory across a cluster in directly computable form.  H2o also
> provides a very fine-grained parallel execution framework that allows
> computation to be moved close to the data while maintaining computational
> efficiency with tasks as small as milliseconds in scale.  Objects live on
> multiple machines and live until they are explicitly deallocated or until
> the framework is terminated.
> 
> Additional machines can join the framework, but data isn't automatically
> balanced, nor is it assumed that failures are handled within the framework.
> As might be expected given the background of the authors, some pretty
> astounding things are done using JVM magic so coding at this lowest level
> is remarkably congenial.
> 
> This framework can be deployed as a map-only Hadoop program, or as a bunch
> of independent programs which borg together as they come up.  Importantly,
> it is trivial to start a single node framework as well for easy development
> and testing.
> 
> On top of this lowest level, there are math libraries which implement low
> level
> operations as well as a variety of machine learning algorithms.  These
> include
> high quality implementations of a variety of machine learning programs
> including
> generalized linear modeling with binomial logistic regression and good
> regularization, linear regression, neural networks, random forests and so
> on.
> There are also parsing codes which will load formatted data in parallel from
> persistency layers such as HDFS or conventional files.
> 
> At the level of these learning programs, there are web interfaces which
> allow
> data elements in the framework to be created, managed and deleted.
> 
> There is also an R binding for h2o which allows programs to access and
> manage h2o objects.  Functions defined in an R-like language can be applied
> in parallel to
> data frames stored in the h2o framework.
> 
> *Proposed Developer User Experience*
> 
> I see several kinds of users.  These include numerical developers (largely
> mathematicians), Java or Scala developers (like current Mahout devs), and
> data
> analysts.
> 
> - Local h2o single-node cluster
> - Temporary h2o cluster
> - Shared h2o cluster
> 
> All of these modes will be facilitated by the proposed development.
> 
> *Complementarity with Other Platforms*
> 
> I view h2o as complementary with Hadoop and Spark because it provides a
> solid in-memory execution engine as opposed to a general out-of-core
> computation model that other map-reduce engines like Hadoop and Spark
> implement or more general dataflow systems like Stratosphere, Tez or Drill.
> 
> Also, h2o provides no persistence but depends on other systems for that
> such as NFS, HDFS, NAS or MapR.
> 
> H2o is also nicely complimentary to R in that R can invoke operations and
> move data to and from h2o very easily.
> 
> *Required Additional Work*
> 
> Sparse matrices
> Linear algebra bindings
> Class-file magic to allow off-the-cuff function definitions

Mime
View raw message