Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8ADB211FCF for ; Thu, 14 Aug 2014 06:38:48 +0000 (UTC) Received: (qmail 35058 invoked by uid 500); 14 Aug 2014 06:38:48 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 34994 invoked by uid 500); 14 Aug 2014 06:38:48 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 34978 invoked by uid 99); 14 Aug 2014 06:38:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Aug 2014 06:38:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of freeman.jeremy@gmail.com designates 209.85.216.171 as permitted sender) Received: from [209.85.216.171] (HELO mail-qc0-f171.google.com) (209.85.216.171) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Aug 2014 06:38:22 +0000 Received: by mail-qc0-f171.google.com with SMTP id r5so717825qcx.16 for ; Wed, 13 Aug 2014 23:38:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :message-id:references:to; bh=H55AH7DVMsdq+jdQ3pue06/EUgU/f99j3jcuxZ8lYbw=; b=Qy67ad/64QiV/U3zN8pSKXJT0i0YfeeiZZ6S0654noFNXdKNmEodPl5h3qGg2xBQoF pxf1aBZP5GGqs4p08qJkfgnV7KDXEAy35n9RPkudhMKsFk2t9rLAaYZN2gIBbU+ctdQx mS1aakXH7PYbcc1ugQXG50DUcDupc/GsC42w3Nq4W8RPMB03E7KhmyVzA3ZxntTn2taY n6aARgal7ROyk/25LTEt+5TMyFy6zsMTNK9I0mXScAe4ZyfD3RnmbMuhZhw5ui/zkFK1 pB9iuRYVwoQ3+XGdDWZQUWiftXW3CcA6C1IKaMP91aRz0A4W/cWC8sSrqEM0IzMXTANb ukmw== X-Received: by 10.140.23.37 with SMTP id 34mr13471814qgo.2.1407998300850; Wed, 13 Aug 2014 23:38:20 -0700 (PDT) Received: from [10.60.1.87] (simcoe.janelia.org. [206.241.0.254]) by mx.google.com with ESMTPSA id u5sm7252812qae.18.2014.08.13.23.38.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 13 Aug 2014 23:38:20 -0700 (PDT) Content-Type: multipart/alternative; boundary="Apple-Mail=_AFA512DE-29FA-49D1-80FB-035265AE0876" Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms From: Jeremy Freeman In-Reply-To: Date: Thu, 14 Aug 2014 02:38:18 -0400 Cc: Nicholas Chammas , Reynold Xin , "dev@spark.apache.org" Message-Id: <0388F2F0-F7F1-45E3-9930-FCFDD0FFE825@gmail.com> References:

To: Ignacio Zendejas X-Mailer: Apple Mail (2.1510) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_AFA512DE-29FA-49D1-80FB-035265AE0876 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii @Ignacio, happy to share, here's a link to a library we've been = developing (https://github.com/freeman-lab/thunder). As just a couple = examples, we have pipelines that use fourier transforms and other signal = processing from scipy, and others that do massively parallel model = fitting via Scikit learn functions, etc. That should give you some idea = of how such libraries could be usefully integrated into a PySpark = project. Btw, a couple things we do overlap with functionality now = available in MLLib via the Python API, which we're working on = integrating. On Aug 13, 2014, at 5:16 PM, Ignacio Zendejas = wrote: > Yep, I thought it was a bogus comparison. >=20 > I should rephrase my question as it was poorly phrased: on average, = how > much faster is Spark v. PySpark (I didn't really mean Scala v. = Python)? > I've only used Spark and don't have a chance to test this at the = moment so > if anybody has these numbers or general estimates (10x, etc), that'd = be > great. >=20 > @Jeremy, if you can discuss this, what's an example of a project you > implemented using these libraries + PySpark? >=20 > Thanks everyone! >=20 >=20 >=20 >=20 > On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas < > nicholas.chammas@gmail.com> wrote: >=20 >> On a related note, I recently heard about Distributed R >> , which is coming out of >> HP/Vertica and seems to be their proposition for machine learning at = scale. >>=20 >> It would be interesting to see some kind of comparison between that = and >> MLlib (and perhaps also SparkR >> ?), especially since >> Distributed R has a concept of distributed arrays and works on data >> in-memory. Docs are here. >> >>=20 >> Nick >>=20 >>=20 >> On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin = wrote: >>=20 >>> They only compared their own implementations of couple algorithms on >>> different platforms rather than comparing the different platforms >>> themselves (in the case of Spark -- PySpark). I can write two = variants of >>> an algorithm on Spark and make them perform drastically differently. >>>=20 >>> I have no doubt if you implement a ML algorithm in Python itself = without >>> any native libraries, the performance will be sub-optimal. >>>=20 >>> What PySpark really provides is: >>>=20 >>> - Using Spark transformations in Python >>> - ML algorithms implemented in Scala (leveraging native numerical >>> libraries >>> for high performance), and callable in Python >>>=20 >>> The paper claims "Python is now one of the most popular languages = for >>> ML-oriented programming", and that's why they went ahead with = Python. >>> However, as I understand, very few people actually implement = algorithms in >>> Python directly because of the sub-optimal performance. Most people >>> implement algorithms in other languages (e.g. C / Java), and expose = APIs >>> in >>> Python for ease-of-use. This is what we are trying to do with = PySpark as >>> well. >>>=20 >>>=20 >>> On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas < >>> ignacio.zendejas.cs@gmail.com> wrote: >>>=20 >>>> Has anyone had a chance to look at this paper (with title in = subject)? >>>> http://www.cs.rice.edu/~lp6/comparison.pdf >>>>=20 >>>> Interesting that they chose to use Python alone. Do we know how = much >>> faster >>>> Scala is vs. Python in general, if at all? >>>>=20 >>>> As with any and all benchmarks, I'm sure there are caveats, but = it'd be >>>> nice to have a response to the question above for starters. >>>>=20 >>>> Thanks, >>>> Ignacio >>>>=20 >>>=20 >>=20 >>=20 --Apple-Mail=_AFA512DE-29FA-49D1-80FB-035265AE0876--