Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C0B61012C for ; Thu, 13 Mar 2014 01:16:43 +0000 (UTC) Received: (qmail 27685 invoked by uid 500); 13 Mar 2014 01:16:41 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 27607 invoked by uid 500); 13 Mar 2014 01:16:40 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 27599 invoked by uid 99); 13 Mar 2014 01:16:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Mar 2014 01:16:40 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of andrew.musselman@gmail.com designates 209.85.160.41 as permitted sender) Received: from [209.85.160.41] (HELO mail-pb0-f41.google.com) (209.85.160.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Mar 2014 01:16:30 +0000 Received: by mail-pb0-f41.google.com with SMTP id jt11so329315pbb.14 for ; Wed, 12 Mar 2014 18:16:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:content-transfer-encoding:mime-version:subject :message-id:date:references:in-reply-to:to; bh=shUnxaFJN4d2ueqyb0gmt5uZyhWxls1Cp3Q2PfhZH6k=; b=ryDO+kYBj81ZB9qbGQa8H636XRZQyYesgn8xtF2RshrvEz7U21Otjlo0zduglDzfg4 xabV+4KJbI9bz9rgJM9fMSjRtqjZk9HSIdBmovT+TBYM7+RxTs4plb5VRIMnTmzpZi4l eypAvbPSZqlaMeWKhoEtXWOPDHXi1WlZOqbcy8rrv9qXYpJcYglssMILnqwppRXZsqF8 trbvT+aP199Hcsu5UGaVnMuXJRtp61zB4H7l4OP8gjajOVguZkRUPDl/WehF0zcKMsYW X6xsv4z+6ZSsYxC9MbCjPzXqToNU7LgCmR/uMW7q3jAzK/sh2jHXZV1cxCsNbVaKrUj7 zF2Q== X-Received: by 10.68.2.99 with SMTP id 3mr718621pbt.49.1394673368653; Wed, 12 Mar 2014 18:16:08 -0700 (PDT) Received: from [10.0.0.5] (c-76-104-138-48.hsd1.wa.comcast.net. [76.104.138.48]) by mx.google.com with ESMTPSA id lh13sm1137035pab.4.2014.03.12.18.16.07 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 12 Mar 2014 18:16:07 -0700 (PDT) From: Andrew Musselman Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: 0xdata interested in contributing Message-Id: <14B10982-84F0-47E7-BB9E-E7EF3EB8B41E@gmail.com> Date: Wed, 12 Mar 2014 18:16:06 -0700 References: In-Reply-To: To: "dev@mahout.apache.org" X-Mailer: iPhone Mail (11D167) X-Virus-Checked: Checked by ClamAV on apache.org Sounds like a large positive step; looking forward to hearing more! > On Mar 12, 2014, at 5:44 PM, Ted Dunning wrote: >=20 > I have been working with a company named 0xdata to help them contribute > some new software to Mahout. This software will give Mahout the ability t= o > do highly iterative in-memory mathematical computations on a cluster or a > single machine. This software also comes with high performance distributed= > implementations of k-means, logistic regression, random forest and other > algorithms. >=20 > I will be starting a thread about this on the dev list shortly, but I > wanted the PMC members to have a short heads up on what has been happening= > now that we have consensus on the 0xdata side of the game. >=20 > I think that this has a major potential to bring in an enormous amount of > contributing community to Mahout. Technically, it will, at a stroke, make= > Mahout the highest performing machine learning framework around. >=20 > *Development Roadmap* >=20 > Of the requirements that people have been talking about on the main mailin= g > list, the following capabilities will be provided by this contribution: >=20 > 1) high performance distributed linear algebra >=20 > 2) basic machine learning codes including logistic regression, other > generalized > linear modeling codes, random forest, clustering >=20 > 3) standard file format parsing system (CSV, Lucene, parquet, other) x > (continuous, constant, categorical, word-like, text-like) >=20 > 4) standard web-based basic applications for common operations >=20 > 5) language bindings (Java, Scala, R, other) >=20 > 6) interactive + batch use >=20 > 7) common representation/good abstraction over representation >=20 > 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos, EC2,= > GCE ) >=20 >=20 > *Backstory* >=20 > I was recently approached by the Sri Satish, CEO and co-founder of 0xdata > who > wanted to explore whether they could donate some portion of the h2o > framework and technology to Mahout. I was skeptical since all that I had > previously seen was the application level demos for this system and was no= t > at all familiar with the technology underneath. One of the co-founders of > 0xdata, however, is Cliff Click who was one of the co-authors of the serve= r > HotSpot compiler. That alone made the offer worth examining. >=20 > Over the last few weeks, the technical team of 0xdata has been working wit= h > me to work out whether this contribution would be useful to Mahout. >=20 > My strong conclusion is that the donation, with some associated shim work > that 0xdata is committing to doing will satisfy roughly 80% of the goals > that have emerged other the last week or so of discussion. Just as > important, this donation connects Mahout to new communities who are very > actively working at the frontiers machine learning which is likely to > inject lots of new blood and excitement into the Mahout community. This > has huge potential outside of Mahout itself as well since having a very > strong technical infrastructure that we can all use across many projects > has the potential to have the same sort of impact on machine learning > applications and products that Hadoop has had for file-based parallel > processing. Coming together on a common platform has the potential to > create markets that would otherwise not exist if we don't have this > commonality. >=20 >=20 > *Technical Underpinnings* >=20 > At the lowest level, the h2o framework provides a way to have named object= s > stored in memory across a cluster in directly computable form. H2o also > provides a very fine-grained parallel execution framework that allows > computation to be moved close to the data while maintaining computational > efficiency with tasks as small as milliseconds in scale. Objects live on > multiple machines and live until they are explicitly deallocated or until > the framework is terminated. >=20 > Additional machines can join the framework, but data isn't automatically > balanced, nor is it assumed that failures are handled within the framework= . > As might be expected given the background of the authors, some pretty > astounding things are done using JVM magic so coding at this lowest level > is remarkably congenial. >=20 > This framework can be deployed as a map-only Hadoop program, or as a bunch= > of independent programs which borg together as they come up. Importantly,= > it is trivial to start a single node framework as well for easy developmen= t > and testing. >=20 > On top of this lowest level, there are math libraries which implement low > level > operations as well as a variety of machine learning algorithms. These > include > high quality implementations of a variety of machine learning programs > including > generalized linear modeling with binomial logistic regression and good > regularization, linear regression, neural networks, random forests and so > on. > There are also parsing codes which will load formatted data in parallel fr= om > persistency layers such as HDFS or conventional files. >=20 > At the level of these learning programs, there are web interfaces which > allow > data elements in the framework to be created, managed and deleted. >=20 > There is also an R binding for h2o which allows programs to access and > manage h2o objects. Functions defined in an R-like language can be applie= d > in parallel to > data frames stored in the h2o framework. >=20 > *Proposed Developer User Experience* >=20 > I see several kinds of users. These include numerical developers (largely= > mathematicians), Java or Scala developers (like current Mahout devs), and > data > analysts. >=20 > - Local h2o single-node cluster > - Temporary h2o cluster > - Shared h2o cluster >=20 > All of these modes will be facilitated by the proposed development. >=20 > *Complementarity with Other Platforms* >=20 > I view h2o as complementary with Hadoop and Spark because it provides a > solid in-memory execution engine as opposed to a general out-of-core > computation model that other map-reduce engines like Hadoop and Spark > implement or more general dataflow systems like Stratosphere, Tez or Drill= . >=20 > Also, h2o provides no persistence but depends on other systems for that > such as NFS, HDFS, NAS or MapR. >=20 > H2o is also nicely complimentary to R in that R can invoke operations and > move data to and from h2o very easily. >=20 > *Required Additional Work* >=20 > Sparse matrices > Linear algebra bindings > Class-file magic to allow off-the-cuff function definitions