spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shivaram <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-6806] [SPARKR] [DOCS] Add a new SparkR ...
Date Fri, 29 May 2015 18:00:35 GMT
Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6490#discussion_r31348609
  
    --- Diff: docs/sparkr.md ---
    @@ -0,0 +1,198 @@
    +---
    +layout: global
    +displayTitle: SparkR (R on Spark)
    +title: SparkR (R on Spark)
    +---
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +# Overview
    +SparkR is an R package that provides a light-weight frontend to use Apache Spark from
R.
    +In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame implementation
that
    +supports operations similar to R data frames, [dplyr](https://github.com/hadley/dplyr)
but on large
    +datasets.
    +
    +# SparkR DataFrames
    +
    +A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    +equivalent to a table in a relational database or a data frame in R, but with richer
    +optimizations under the hood. DataFrames can be constructed from a wide array of sources
such as:
    +structured data files, tables in Hive, external databases, or existing local R data frames.
    +
    +All of the examples on this page use sample data included in R or the Spark distribution
and can be run using the `./bin/sparkR` shell.
    +
    +## Starting Up: SparkContext, SQLContext
    +
    +<div data-lang="r"  markdown="1">
    +The entry point into SparkR is the `SparkContext` which connects your R program to a
Spark cluster.
    +You can create a `SparkContext` using `sparkR.init` and pass in options such as the application
name
    +etc. Further, to work with DataFrames we will need a `SQLContext`, which can be created
from the 
    +SparkContext. If you are working from the SparkR shell, the `SQLContext` and `SparkContext`
should
    +already be created for you.
    +
    +{% highlight r %}
    +sc <- sparkR.init()
    +sqlContext <- sparkRSQL.init(sc)
    --- End diff --
    
    Yeah I was thinking about the fact that we have these two init calls being wasteful. But
longer term when we say want to introduce ML stuff which requires the SparkContext it might
be good to familiarize users with the idea of having  a SparkContext around ?
    
    We can definitely do an implicit sparkR.init though if we find that no spark context exists
(something like the logic we use in https://github.com/apache/spark/blob/e7b61775571ce7a06d044bc3a6055ff94c7477d6/R/pkg/R/sparkR.R#L105)
    
    We can also check if the SparkContext 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message