mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: 0xdata interested in contributing
Date Thu, 13 Mar 2014 20:11:24 GMT
I don't think that it is Spark vs h2o.  They do different things.  Spark
(and Drill and Tez and Impala and Stratosphere) do what map-reduce wanted
to do.

H2o does math.

And I don't think we are betting our future.  I think we are letting some
contributors show us their chops and hopefully make Mahout much better.

And, no, it isn't 10x faster.  For many important Mahout applications like
classifiers, it is more than 100x or more faster.


On Thu, Mar 13, 2014 at 1:01 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> Has anyone used 0xdata before? They are new to me. If this is betting
> Mahout’s future on h2o vs Spark, is everyone convinced that’s the right
> choice? Does Mahout warrant or need its own next gen fast parallel
> platform? Does this mean supporting something akin to Spark as part of
> Mahout or is h2o going to become another project? Is it 10x faster for our
> our uses? Is it easier to move the code to h2o than Spark?
>
> I’m all for moving somewhere soon, but taking the wrong step is just as
> deadly as taking none.
>
> On Mar 13, 2014, at 11:35 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>
> +1.  Happy to help get it migrated!
>
>
> On Mar 12, 2014, at 8:44 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > I have been working with a company named 0xdata to help them contribute
> > some new software to Mahout.  This software will give Mahout the ability
> to
> > do highly iterative in-memory mathematical computations on a cluster or a
> > single machine. This software also comes with high performance
> distributed
> > implementations of k-means, logistic regression, random forest and other
> > algorithms.
> >
> > I will be starting a thread about this on the dev list shortly, but I
> > wanted the PMC members to have a short heads up on what has been
> happening
> > now that we have consensus on the 0xdata side of the game.
> >
> > I think that this has a major potential to bring in an enormous amount of
> > contributing community to Mahout.  Technically, it will, at a stroke,
> make
> > Mahout the highest performing machine learning framework around.
> >
> > *Development Roadmap*
> >
> > Of the requirements that people have been talking about on the main
> mailing
> > list, the following capabilities will be provided by this contribution:
> >
> > 1) high performance distributed linear algebra
> >
> > 2) basic machine learning codes including logistic regression, other
> > generalized
> > linear modeling codes, random forest, clustering
> >
> > 3) standard file format parsing system (CSV, Lucene, parquet, other) x
> >   (continuous, constant, categorical, word-like, text-like)
> >
> > 4) standard web-based basic applications for common operations
> >
> > 5) language bindings (Java, Scala, R, other)
> >
> > 6) interactive + batch use
> >
> > 7) common representation/good abstraction over representation
> >
> > 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos,
> EC2,
> > GCE )
> >
> >
> > *Backstory*
> >
> > I was recently approached by the Sri Satish, CEO and co-founder of 0xdata
> > who
> > wanted to explore whether they could donate some portion of the h2o
> > framework and technology to Mahout.  I was skeptical since all that I had
> > previously seen was the application level demos for this system and was
> not
> > at all familiar with the technology underneath. One of the co-founders of
> > 0xdata, however, is Cliff Click who was one of the co-authors of the
> server
> > HotSpot compiler.  That alone made the offer worth examining.
> >
> > Over the last few weeks, the technical team of 0xdata has been working
> with
> > me to work out whether this contribution would be useful to Mahout.
> >
> > My strong conclusion is that the donation, with some associated shim work
> > that 0xdata is committing to doing will satisfy roughly 80% of the goals
> > that have emerged other the last week or so of discussion.  Just as
> > important, this donation connects Mahout to new communities who are very
> > actively working at the frontiers machine learning which is likely to
> > inject lots of new blood and excitement into the Mahout community.  This
> > has huge potential outside of Mahout itself as well since having a very
> > strong technical infrastructure that we can all use across many projects
> > has the potential to have the same sort of impact on machine learning
> > applications and products that Hadoop has had for file-based parallel
> > processing.  Coming together on a common platform has the potential to
> > create markets that would otherwise not exist if we don't have this
> > commonality.
> >
> >
> > *Technical Underpinnings*
> >
> > At the lowest level, the h2o framework provides a way to have named
> objects
> > stored in memory across a cluster in directly computable form.  H2o also
> > provides a very fine-grained parallel execution framework that allows
> > computation to be moved close to the data while maintaining computational
> > efficiency with tasks as small as milliseconds in scale.  Objects live on
> > multiple machines and live until they are explicitly deallocated or until
> > the framework is terminated.
> >
> > Additional machines can join the framework, but data isn't automatically
> > balanced, nor is it assumed that failures are handled within the
> framework.
> > As might be expected given the background of the authors, some pretty
> > astounding things are done using JVM magic so coding at this lowest level
> > is remarkably congenial.
> >
> > This framework can be deployed as a map-only Hadoop program, or as a
> bunch
> > of independent programs which borg together as they come up.
>  Importantly,
> > it is trivial to start a single node framework as well for easy
> development
> > and testing.
> >
> > On top of this lowest level, there are math libraries which implement low
> > level
> > operations as well as a variety of machine learning algorithms.  These
> > include
> > high quality implementations of a variety of machine learning programs
> > including
> > generalized linear modeling with binomial logistic regression and good
> > regularization, linear regression, neural networks, random forests and so
> > on.
> > There are also parsing codes which will load formatted data in parallel
> from
> > persistency layers such as HDFS or conventional files.
> >
> > At the level of these learning programs, there are web interfaces which
> > allow
> > data elements in the framework to be created, managed and deleted.
> >
> > There is also an R binding for h2o which allows programs to access and
> > manage h2o objects.  Functions defined in an R-like language can be
> applied
> > in parallel to
> > data frames stored in the h2o framework.
> >
> > *Proposed Developer User Experience*
> >
> > I see several kinds of users.  These include numerical developers
> (largely
> > mathematicians), Java or Scala developers (like current Mahout devs), and
> > data
> > analysts.
> >
> > - Local h2o single-node cluster
> > - Temporary h2o cluster
> > - Shared h2o cluster
> >
> > All of these modes will be facilitated by the proposed development.
> >
> > *Complementarity with Other Platforms*
> >
> > I view h2o as complementary with Hadoop and Spark because it provides a
> > solid in-memory execution engine as opposed to a general out-of-core
> > computation model that other map-reduce engines like Hadoop and Spark
> > implement or more general dataflow systems like Stratosphere, Tez or
> Drill.
> >
> > Also, h2o provides no persistence but depends on other systems for that
> > such as NFS, HDFS, NAS or MapR.
> >
> > H2o is also nicely complimentary to R in that R can invoke operations and
> > move data to and from h2o very easily.
> >
> > *Required Additional Work*
> >
> > Sparse matrices
> > Linear algebra bindings
> > Class-file magic to allow off-the-cuff function definitions
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message