spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
Date Wed, 13 Aug 2014 21:20:24 GMT
BTW you can find the original Presto (rebranded as Distributed R) paper
here:
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf


On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin <rxin@databricks.com> wrote:

> Actually I believe the same person started both projects.
>
> The Distributed R project from HP was started by Shivaram Venkataraman
> when he was there. He since moved to Berkeley AMPLab to pursue a PhD and
> SparkR was his latest project.
>
>
>
> On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> On a related note, I recently heard about Distributed R
>> <https://github.com/vertica/DistributedR>, which is coming out of
>> HP/Vertica and seems to be their proposition for machine learning at scale.
>>
>> It would be interesting to see some kind of comparison between that and
>> MLlib (and perhaps also SparkR
>> <https://github.com/amplab-extras/SparkR-pkg>?), especially since
>> Distributed R has a concept of distributed arrays and works on data
>> in-memory. Docs are here.
>> <https://github.com/vertica/DistributedR/tree/master/doc/platform>
>>
>> Nick
>>
>>
>> On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> They only compared their own implementations of couple algorithms on
>>> different platforms rather than comparing the different platforms
>>> themselves (in the case of Spark -- PySpark). I can write two variants of
>>> an algorithm on Spark and make them perform drastically differently.
>>>
>>> I have no doubt if you implement a ML algorithm in Python itself without
>>> any native libraries, the performance will be sub-optimal.
>>>
>>> What PySpark really provides is:
>>>
>>> - Using Spark transformations in Python
>>> - ML algorithms implemented in Scala (leveraging native numerical
>>> libraries
>>> for high performance), and callable in Python
>>>
>>> The paper claims "Python is now one of the most popular languages for
>>> ML-oriented programming", and that's why they went ahead with Python.
>>> However, as I understand, very few people actually implement algorithms
>>> in
>>> Python directly because of the sub-optimal performance. Most people
>>> implement algorithms in other languages (e.g. C / Java), and expose APIs
>>> in
>>> Python for ease-of-use. This is what we are trying to do with PySpark as
>>> well.
>>>
>>>
>>> On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas <
>>> ignacio.zendejas.cs@gmail.com> wrote:
>>>
>>> > Has anyone had a chance to look at this paper (with title in subject)?
>>> > http://www.cs.rice.edu/~lp6/comparison.pdf
>>> >
>>> > Interesting that they chose to use Python alone. Do we know how much
>>> faster
>>> > Scala is vs. Python in general, if at all?
>>> >
>>> > As with any and all benchmarks, I'm sure there are caveats, but it'd be
>>> > nice to have a response to the question above for starters.
>>> >
>>> > Thanks,
>>> > Ignacio
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message