Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A36861023E for ; Thu, 1 May 2014 16:02:09 +0000 (UTC) Received: (qmail 57557 invoked by uid 500); 1 May 2014 16:02:06 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 57468 invoked by uid 500); 1 May 2014 16:02:04 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 57446 invoked by uid 99); 1 May 2014 16:02:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2014 16:02:02 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of pat.ferrel@gmail.com designates 209.85.220.41 as permitted sender) Received: from [209.85.220.41] (HELO mail-pa0-f41.google.com) (209.85.220.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2014 16:01:57 +0000 Received: by mail-pa0-f41.google.com with SMTP id lj1so1099073pab.0 for ; Thu, 01 May 2014 09:01:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=3VuDH1Y/yJHTjMPf8oLy+3S2ppntlOBa4lNZR42WHKo=; b=L2pVp3bgWPXRNbPVtjNgeJydMyPAXa8VYjhVlwWWukfPl8080sREMPrh512R6sj3B4 T4MLN3nmlxBLM4bao28G0pY4jQpRmPmOyr5Nc+XWIu8Bc5jFFDyl7PS9roprZYOQ+Msl Fku0oZJg1RZGBrZGju1iMug7cjzESeGz8gkG3iqVkAcOrA/0z5HIzHfEFglmwCrX+d4i ZDN7Rd0Du3BV9aF/mwsOihHpglgbv0NTsjJMG4CrXfrXmoTpM1IHLkx6GgZhDpMfsfYw 1cfv66eeuFWP/rU2K1xawpi/dY8ZurPKsCA7UK8o9SebeJpThaW3jZEwV+OrA/ZXDbc3 utyg== X-Received: by 10.66.233.72 with SMTP id tu8mr22698334pac.112.1398960093374; Thu, 01 May 2014 09:01:33 -0700 (PDT) Received: from [192.168.0.4] ([63.142.207.22]) by mx.google.com with ESMTPSA id yv7sm160721562pac.33.2014.05.01.09.01.32 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 01 May 2014 09:01:32 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\)) Subject: Re: Straw poll re: H2O ? From: Pat Ferrel In-Reply-To: <53626AD3.50804@gmail.com> Date: Thu, 1 May 2014 09:01:43 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <535E9970.7010907@apache.org> <535FFC34.6050702@apache.org> <53600050.5090502@apache.org> <37069B8F-9388-4048-85E3-BA456EEB2DBF@gmail.com> <53626AD3.50804@gmail.com> To: dev@mahout.apache.org X-Mailer: Apple Mail (2.1874) X-Virus-Checked: Checked by ClamAV on apache.org Odd that the Kmeans implementation isn=92t a way to demonstrate = performance. Seems like anyone could grab that and try it with the same = data on MLlib and perform a principled analysis. Or just run the same = data through h2o and MLlib. This seems like a good way to look at the = forrest instead of the trees. BTW any generalization effort to support two execution engines will have = to abstract away the SparkContext. This is where IO, job control, and = engine tuning happens. Abstracting the DSL is not sufficient. Any = hypothetical MahoutContext (a good idea for sure) if it deviated = significantly from a SparkContext will have broad impact. = http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.Sp= arkContext On May 1, 2014, at 8:40 AM, Cliff Click wrote: H2O will launch an internal Task in the single-digit microsecond range. = Because of this, we can launch 100,000's (millions?) a second... leading = to fine-grained data parallelism, and high CPU utilization. This is a = big piece of our single-node speed. Some other distributed = Task-launching solutions I've seen tend to require a network-hop = per-task... leading to your 10ms to launch as task requirement, leading = to a limit of a few 1000 Tasks/sec requiring tasks that are much larger = and coarser than H2O's... leading to much lower CPU utilization. Also, I'm getting 200micro-second ping's between my datacenter = machines.... down from 10msec. It's decent commodity hardware, nothing = special. Meaning: H2O can launch task on an entire 32-node cluster in = about 1msec, starting from a single driving node (log-tree fanout, depth = 5, 200micro-second single UDP packet launch, 1micro-second internal task = launch). And this latency matters when the work itself is lots and lots "small" = jobs, as is common when a DSL such as Mahout or Spark/Scala or R is = driving simple operators over bulk data. Cliff On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote: > This is kind of an old news. They all do, for years now. I've been = building a system that does real time distributed pipelines (~30 ms to = start all steps in pipeline + in-core complexity) for years. Note that = node-to-node hop in clouds are usually mean at about 10ms so = microseconds are kind of out of question for network performance reasons = in real life except for private racks. The only thing that doesn't do = this is the MR variety of Hadoop.=20