spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From junyangq <...@git.apache.org>
Subject [GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette
Date Wed, 07 Sep 2016 08:18:53 GMT
Github user junyangq commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14980#discussion_r77778144
  
    --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
    @@ -0,0 +1,853 @@
    +---
    +title: "SparkR - Practical Guide"
    +output:
    +  html_document:
    +    theme: united
    +    toc: true
    +    toc_depth: 4
    +    toc_float: true
    +    highlight: textmate
    +---
    +
    +## Overview
    +
    +SparkR is an R package that provides a light-weight frontend to use Apache Spark from
R. In Spark 2.0.0, SparkR provides a distributed data frame implementation that supports data
processing operations like selection, filtering, aggregation etc. and distributed machine
learning using [MLlib](http://spark.apache.org/mllib/).
    +
    +## Getting Started
    +
    +We begin with an example running on the local machine and provide an overview of the
use of SparkR: data ingestion, data processing and machine learning.
    +
    +First, let's load and attach the package.
    +```{r, message=FALSE}
    +library(SparkR)
    +```
    +
    +`SparkSession` is the entry point into SparkR which connects your R program to a Spark
cluster. You can create a `SparkSession` using `sparkR.session` and pass in options such as
the application name, any Spark packages depended on, etc.
    +
    +We use default settings in which it runs in local mode. It auto downloads Spark package
in the background if no previous installation is found. For more details about setup, see
[Spark Session](#SetupSparkSession).
    +
    +```{r, message=FALSE, warning=FALSE}
    +sparkR.session()
    +```
    +
    +The operations in SparkR are centered around an R class called `SparkDataFrame`. It is
a distributed collection of data organized into named columns, which is conceptually equivalent
to a table in a relational database or a data frame in R, but with richer optimizations under
the hood.
    +
    +`SparkDataFrame` can be constructed from a wide array of sources such as: structured
data files, tables in Hive, external databases, or existing local R data frames. For example,
we create a `SparkDataFrame` from a local R data frame,
    +
    +```{r}
    +cars <- cbind(model = rownames(mtcars), mtcars)
    +carsDF <- createDataFrame(cars)
    +```
    +
    +We can view the first few rows of the `SparkDataFrame` by `showDF` or `head` function.
    +```{r}
    +showDF(carsDF)
    +```
    +
    +Common data processing operations such as `filter`, `select` are supported on the `SparkDataFrame`.
    +```{r}
    +carsSubDF <- select(carsDF, "model", "mpg", "hp")
    +carsSubDF <- filter(carsSubDF, carsSubDF$hp >= 200)
    +showDF(carsSubDF)
    +```
    +
    +SparkR can use many common aggregation functions after grouping.
    +
    +```{r}
    +carsGPDF <- summarize(groupBy(carsDF, carsDF$gear), count = n(carsDF$gear))
    +showDF(carsGPDF)
    +```
    +
    +The results `carsDF` and `carsSubDF` are `SparkDataFrame` objects. To convert back to
R `data.frame`, we can use `collect`.
    +```{r}
    +carsGP <- collect(carsGPDF)
    +class(carsGP)
    +```
    +
    +SparkR supports a number of commonly used machine learning algorithms. Under the hood,
SparkR uses MLlib to train the model. Users can call `summary` to print a summary of the fitted
model, `predict` to make predictions on new data, and `write.ml`/`read.ml` to save/load fitted
models.
    +
    +SparkR supports a subset of R formula operators for model fitting, including ‘~’,
‘.’, ‘:’, ‘+’, and ‘-‘. We use linear regression as an example.
    +```{r}
    +model <- spark.glm(carsDF, mpg ~ wt + cyl)
    --- End diff --
    
    Did you mean, say, `model <- spark.glm(data = carsDF, formula = mpg ~ wt + cyl)` or?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message