mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Straw poll re: H2O ?
Date Thu, 01 May 2014 16:01:43 GMT
Odd that the Kmeans implementation isn’t a way to demonstrate performance. Seems like anyone
could grab that and try it with the same data on MLlib and perform a principled analysis.
Or just run the same data through h2o and MLlib. This seems like a good way to look at the
forrest instead of the trees.

BTW any generalization effort to support two execution engines will have to abstract away
the SparkContext. This is where IO, job control, and engine tuning happens. Abstracting the
DSL is not sufficient. Any hypothetical MahoutContext (a good idea for sure) if it deviated
significantly from a SparkContext will have broad impact.

http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext


On May 1, 2014, at 8:40 AM, Cliff Click <cclick0@gmail.com> wrote:

H2O will launch an internal Task in the single-digit microsecond range.  Because of this,
we can launch 100,000's (millions?) a second... leading to fine-grained data parallelism,
and high CPU utilization.  This is a big piece of our single-node speed.  Some other distributed
Task-launching solutions I've seen tend to require a network-hop per-task... leading to your
10ms to launch as task requirement, leading to a limit of a few 1000 Tasks/sec requiring tasks
that are much larger and coarser than H2O's... leading to much lower CPU utilization.

Also, I'm getting 200micro-second ping's between my datacenter machines.... down from 10msec.
 It's decent commodity hardware, nothing special.  Meaning: H2O can launch task on an entire
32-node cluster in about 1msec, starting from a single driving node (log-tree fanout, depth
5, 200micro-second single UDP packet launch, 1micro-second internal task launch).

And this latency matters when the work itself is lots and lots "small" jobs, as is common
when a DSL such as Mahout or Spark/Scala or R is driving simple operators over bulk data.

Cliff


On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote:
> This is kind of an old news. They all do, for years now. I've been building a system
that does real time distributed pipelines (~30 ms to start all steps in pipeline + in-core
complexity) for years. Note that node-to-node hop in clouds are usually mean at about 10ms
so microseconds are kind of out of question for network performance reasons in real life except
for private racks. The only thing that doesn't do this is the MR variety of Hadoop. 



Mime
View raw message