Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of andrew.musselman@gmail.com
 designates 209.85.160.41 as permitted sender)
From: Andrew Musselman <andrew.musselman@gmail.com>
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (1.0)
Subject: Re: 0xdata interested in contributing
Message-Id: <14B10982-84F0-47E7-BB9E-E7EF3EB8B41E@gmail.com>
Date: Wed, 12 Mar 2014 18:16:06 -0700
References: 
 <CAJwFCa3694MK6e5hiUh9sho=ZZh_B2Hi1rqTnjKHeXaSRoOa1A@mail.gmail.com>
In-Reply-To: 
 <CAJwFCa3694MK6e5hiUh9sho=ZZh_B2Hi1rqTnjKHeXaSRoOa1A@mail.gmail.com>
To: "dev@mahout.apache.org" <dev@mahout.apache.org>

Sounds like a large positive step; looking forward to hearing more!

> On Mar 12, 2014, at 5:44 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>=20
> I have been working with a company named 0xdata to help them contribute
> some new software to Mahout.  This software will give Mahout the ability t=
o
> do highly iterative in-memory mathematical computations on a cluster or a
> single machine. This software also comes with high performance distributed=

> implementations of k-means, logistic regression, random forest and other
> algorithms.
>=20
> I will be starting a thread about this on the dev list shortly, but I
> wanted the PMC members to have a short heads up on what has been happening=

> now that we have consensus on the 0xdata side of the game.
>=20
> I think that this has a major potential to bring in an enormous amount of
> contributing community to Mahout.  Technically, it will, at a stroke, make=

> Mahout the highest performing machine learning framework around.
>=20
> *Development Roadmap*
>=20
> Of the requirements that people have been talking about on the main mailin=
g
> list, the following capabilities will be provided by this contribution:
>=20
> 1) high performance distributed linear algebra
>=20
> 2) basic machine learning codes including logistic regression, other
> generalized
> linear modeling codes, random forest, clustering
>=20
> 3) standard file format parsing system (CSV, Lucene, parquet, other) x
>    (continuous, constant, categorical, word-like, text-like)
>=20
> 4) standard web-based basic applications for common operations
>=20
> 5) language bindings (Java, Scala, R, other)
>=20
> 6) interactive + batch use
>=20
> 7) common representation/good abstraction over representation
>=20
> 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos, EC2,=

> GCE )
>=20
>=20
> *Backstory*
>=20
> I was recently approached by the Sri Satish, CEO and co-founder of 0xdata
> who
> wanted to explore whether they could donate some portion of the h2o
> framework and technology to Mahout.  I was skeptical since all that I had
> previously seen was the application level demos for this system and was no=
t
> at all familiar with the technology underneath. One of the co-founders of
> 0xdata, however, is Cliff Click who was one of the co-authors of the serve=
r
> HotSpot compiler.  That alone made the offer worth examining.
>=20
> Over the last few weeks, the technical team of 0xdata has been working wit=
h
> me to work out whether this contribution would be useful to Mahout.
>=20
> My strong conclusion is that the donation, with some associated shim work
> that 0xdata is committing to doing will satisfy roughly 80% of the goals
> that have emerged other the last week or so of discussion.  Just as
> important, this donation connects Mahout to new communities who are very
> actively working at the frontiers machine learning which is likely to
> inject lots of new blood and excitement into the Mahout community.  This
> has huge potential outside of Mahout itself as well since having a very
> strong technical infrastructure that we can all use across many projects
> has the potential to have the same sort of impact on machine learning
> applications and products that Hadoop has had for file-based parallel
> processing.  Coming together on a common platform has the potential to
> create markets that would otherwise not exist if we don't have this
> commonality.
>=20
>=20
> *Technical Underpinnings*
>=20
> At the lowest level, the h2o framework provides a way to have named object=
s
> stored in memory across a cluster in directly computable form.  H2o also
> provides a very fine-grained parallel execution framework that allows
> computation to be moved close to the data while maintaining computational
> efficiency with tasks as small as milliseconds in scale.  Objects live on
> multiple machines and live until they are explicitly deallocated or until
> the framework is terminated.
>=20
> Additional machines can join the framework, but data isn't automatically
> balanced, nor is it assumed that failures are handled within the framework=
.
> As might be expected given the background of the authors, some pretty
> astounding things are done using JVM magic so coding at this lowest level
> is remarkably congenial.
>=20
> This framework can be deployed as a map-only Hadoop program, or as a bunch=

> of independent programs which borg together as they come up.  Importantly,=

> it is trivial to start a single node framework as well for easy developmen=
t
> and testing.
>=20
> On top of this lowest level, there are math libraries which implement low
> level
> operations as well as a variety of machine learning algorithms.  These
> include
> high quality implementations of a variety of machine learning programs
> including
> generalized linear modeling with binomial logistic regression and good
> regularization, linear regression, neural networks, random forests and so
> on.
> There are also parsing codes which will load formatted data in parallel fr=
om
> persistency layers such as HDFS or conventional files.
>=20
> At the level of these learning programs, there are web interfaces which
> allow
> data elements in the framework to be created, managed and deleted.
>=20
> There is also an R binding for h2o which allows programs to access and
> manage h2o objects.  Functions defined in an R-like language can be applie=
d
> in parallel to
> data frames stored in the h2o framework.
>=20
> *Proposed Developer User Experience*
>=20
> I see several kinds of users.  These include numerical developers (largely=

> mathematicians), Java or Scala developers (like current Mahout devs), and
> data
> analysts.
>=20
> - Local h2o single-node cluster
> - Temporary h2o cluster
> - Shared h2o cluster
>=20
> All of these modes will be facilitated by the proposed development.
>=20
> *Complementarity with Other Platforms*
>=20
> I view h2o as complementary with Hadoop and Spark because it provides a
> solid in-memory execution engine as opposed to a general out-of-core
> computation model that other map-reduce engines like Hadoop and Spark
> implement or more general dataflow systems like Stratosphere, Tez or Drill=
.
>=20
> Also, h2o provides no persistence but depends on other systems for that
> such as NFS, HDFS, NAS or MapR.
>=20
> H2o is also nicely complimentary to R in that R can invoke operations and
> move data to and from h2o very easily.
>=20
> *Required Additional Work*
>=20
> Sparse matrices
> Linear algebra bindings
> Class-file magic to allow off-the-cuff function definitions