mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Apache Giraph?
Date Mon, 05 Sep 2011 07:44:12 GMT
2011/9/5 Ted Dunning <ted.dunning@gmail.com>

> I think that alternatives to traditional map-reduce are really important
> for
> ML.  If you think about it, it is now common for 30-40 machine clusters to
> have a terabyte of RAM and it is actually slightly unusual to have ML
> datasets that large.  As such, BSP-like compute models have a lot to
> recommend them.
>
> Also, there are implementations of BSP or pregel that allow the computation
> to exceed memory by requiring all nodes to be serializable.  Only the
> highly
> active ones are kept in memory.  This allows a graceful transition from
> complete memory-residency to something more like traditional hadoop-style
> map-reduce.
>

in regards of that, you may have a look at Apache Hama which is based on BSP
model but also offers a Pregel-like API:
http://incubator.apache.org/hama/

My 2 cents,
Tommaso


>
> Another interesting model is that of Spark.  I can well imagine that much
> of
> what we do could be replaced with very small Spark programs which could be
> composed much more easily than our current map-reduce command-line stuff
> could be glued together.  Lots of codes should experience two orders of
> magnitude speedup from the use of these alternative systems.
>
> The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should
> actually liberate us from the tyranny of a single compute model, but it
> does
> bring us to the point of having to decide how much of this sort of thing we
> want to depend on from another project or how much we want to take on
> ourselves.  For instance, it might well be possible to co-opt the Spark
> community as part of our own ... their purpose was to support machine
> learning and joining forces might help achieve that on a broader scale than
> previously envisioned.  Graphlab is an interesting case as well.
>
> On Sun, Sep 4, 2011 at 11:42 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > Hey gang,
> >
> >  Has anyone here played much with
> > Giraph<http://incubator.apache.org/giraph/>(currently now in the
> > Apache Incubator)?  One of my co-workers ran it on our
> > corporate Hadoop cluster this past weekend, and found it did a very fast
> > PageRank computation (far faster than even well-tuned M/R code on the
> same
> > data), and it worked pretty close to out-of-the box.  Seems like that
> style
> > of computation (in-memory distributed datasets), as used by Giraph (and
> the
> > recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and
> > Spark<http://www.spark-project.org/>, and
> > Twister <http://www.iterativemapreduce.org/>, and Vowpal
> > Wabbit<http://hunch.net/~vw/>,
> > and probably a few others) is more and more the way to go for a lot of
> the
> > things we want to do - scalable machine learning.  "RAM is the new Disk,
> > and
> > Disk is the new Tape" after all...
> >
> >  Giraph in particular seems nice, in that it runs on top of "old
> fashioned"
> > Hadoop - it takes up (long-lived) Mapper slots on your regular cluster,
> > spins up a ZK cluster if you don't supply the location of one, and is all
> > in
> > java (which may be a minus, for some people, I guess, but having to run
> > some
> > big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly
> > awesome) Mesos (Spark [which while running on the JVM, is also in
> Scala]),
> > or run its own totally custom inter-server communication and data
> > structures
> > (Twister and many of the others)).
> >
> >  Seems we should be not just supportive of this kind of thing, but try
> and
> > find some common ground and integration points.
> >
> >  -jake
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message