spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Freeman <freeman.jer...@gmail.com>
Subject Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
Date Thu, 14 Aug 2014 06:38:18 GMT
@Ignacio, happy to share, here's a link to a library we've been developing (https://github.com/freeman-lab/thunder).
As just a couple examples, we have pipelines that use fourier transforms and other signal
processing from scipy, and others that do massively parallel model fitting via Scikit learn
functions, etc. That should give you some idea of how such libraries could be usefully integrated
into a PySpark project. Btw, a couple things we do overlap with functionality now available
in MLLib via the Python API, which we're working on integrating.

On Aug 13, 2014, at 5:16 PM, Ignacio Zendejas <ignacio.zendejas.cs@gmail.com> wrote:

> Yep, I thought it was a bogus comparison.
> 
> I should rephrase my question as it was poorly phrased: on average, how
> much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
> I've only used Spark and don't have a chance to test this at the moment so
> if anybody has these numbers or general estimates (10x, etc), that'd be
> great.
> 
> @Jeremy, if you can discuss this, what's an example of a project you
> implemented using these libraries + PySpark?
> 
> Thanks everyone!
> 
> 
> 
> 
> On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
> 
>> On a related note, I recently heard about Distributed R
>> <https://github.com/vertica/DistributedR>, which is coming out of
>> HP/Vertica and seems to be their proposition for machine learning at scale.
>> 
>> It would be interesting to see some kind of comparison between that and
>> MLlib (and perhaps also SparkR
>> <https://github.com/amplab-extras/SparkR-pkg>?), especially since
>> Distributed R has a concept of distributed arrays and works on data
>> in-memory. Docs are here.
>> <https://github.com/vertica/DistributedR/tree/master/doc/platform>
>> 
>> Nick
>> 
>> 
>> On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin <rxin@databricks.com> wrote:
>> 
>>> They only compared their own implementations of couple algorithms on
>>> different platforms rather than comparing the different platforms
>>> themselves (in the case of Spark -- PySpark). I can write two variants of
>>> an algorithm on Spark and make them perform drastically differently.
>>> 
>>> I have no doubt if you implement a ML algorithm in Python itself without
>>> any native libraries, the performance will be sub-optimal.
>>> 
>>> What PySpark really provides is:
>>> 
>>> - Using Spark transformations in Python
>>> - ML algorithms implemented in Scala (leveraging native numerical
>>> libraries
>>> for high performance), and callable in Python
>>> 
>>> The paper claims "Python is now one of the most popular languages for
>>> ML-oriented programming", and that's why they went ahead with Python.
>>> However, as I understand, very few people actually implement algorithms in
>>> Python directly because of the sub-optimal performance. Most people
>>> implement algorithms in other languages (e.g. C / Java), and expose APIs
>>> in
>>> Python for ease-of-use. This is what we are trying to do with PySpark as
>>> well.
>>> 
>>> 
>>> On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas <
>>> ignacio.zendejas.cs@gmail.com> wrote:
>>> 
>>>> Has anyone had a chance to look at this paper (with title in subject)?
>>>> http://www.cs.rice.edu/~lp6/comparison.pdf
>>>> 
>>>> Interesting that they chose to use Python alone. Do we know how much
>>> faster
>>>> Scala is vs. Python in general, if at all?
>>>> 
>>>> As with any and all benchmarks, I'm sure there are caveats, but it'd be
>>>> nice to have a response to the question above for starters.
>>>> 
>>>> Thanks,
>>>> Ignacio
>>>> 
>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message