spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sekhon (JIRA)" <>
Subject [jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark
Date Fri, 27 Feb 2015 15:38:04 GMT


Hari Sekhon commented on SPARK-5654:

Sean - ever worked for a bank?

What you've said is tantamount to saying Cloudera has zero value because people can download
Apache Hadoop for free from the Apache website and carefully select compatible component versions
(remember Pig vs Hadoop version mismatches anyone), then hand-write all the XML and build
all the automation and packaging yourself, then self-support it based on documentation and
code diving (those days before CDH were good for learning and bad for productivity btw).

Commercial support and professional pre-packaged integration are very important to financials
and other large traditional enterprises (eg. Experian, another former employer) - exactly
the environments where the vendors need to make their bread - those compile-it-yourself self-supporting
web-scale companies like I worked for before Cloudera rarely pay vendors!

Btw I did build SparkR a few times - quite frankly I'm sick of dealing with it for every cluster,
every release, and differing versions of stuff that need to line up to avoid serial id mismatch
exceptions etc.

Nobody wants to give this to quants as a production tool without any support because of the
nature of these large environments, the buck has to stop with somebody - and nobody wants
to put their own head on the chopping block for supplying unsupported technology - that's
one of the reasons vendors like Databricks, Cloudera, Hortonworks etc exist.

I know Alteryx are also eager for it, another tool we use and another problem area of scale
it would solve for all their customers (technically they could rewrite in one of the other
API langs but given they already have modules in R, SparkR would make a bit more sense to
port to), as well as other data scientists I used to work with know who were talking about
wanting this early last year... we thought it would have happened by now... I even asked people
a few months ago such as one of the SparkR guys and vendors who I was told had spoken to Databricks
about it but I've just realized I should have also raised a jira like this directly here myself
as I usually do.

Now Revolution R has been bought by Microsoft, the timing for Databricks to add this is good

> Integrate SparkR into Apache Spark
> ----------------------------------
>                 Key: SPARK-5654
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Project Infra
>            Reporter: Shivaram Venkataraman
> The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R.
The project was started at the AMPLab around a year ago and has been incubated as its own
project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external
dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern
as described in our meetup talk[2], Spark Summit presentation[3].
> Integrating SparkR into the Apache project will enable R users to use Spark out of the
box and given R’s large user base, it will help the Spark project reach more users.  Additionally,
work in progress features like providing R integration with ML Pipelines and Dataframes can
be better achieved by development in a unified code base.
> SparkR is available under the Apache 2.0 License and does not have any external dependencies
other than requiring users to have R and Java installed on their machines.  SparkR’s developers
come from many organizations including UC Berkeley, Alteryx, Intel and we will support future
development, maintenance after the integration.
> [1]
> [2]
> [3]

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message