spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From felixche...@apache.org
Subject spark git commit: [SPARKR][DOC] fix typo in vignettes
Date Mon, 08 May 2017 06:16:47 GMT
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 6c5b7e106 -> d8a5a0d34


[SPARKR][DOC] fix typo in vignettes

## What changes were proposed in this pull request?
Fix typo in vignettes

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes #17884 from actuaryzhang/typo.

(cherry picked from commit 2fdaeb52bbe2ed1a9127ac72917286e505303c85)
Signed-off-by: Felix Cheung <felixcheung@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8a5a0d3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8a5a0d3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8a5a0d3

Branch: refs/heads/branch-2.2
Commit: d8a5a0d3420abbb911d8a80dc7165762eb08d779
Parents: 6c5b7e1
Author: Wayne Zhang <actuaryzhang@uber.com>
Authored: Sun May 7 23:16:30 2017 -0700
Committer: Felix Cheung <felixcheung@apache.org>
Committed: Sun May 7 23:16:44 2017 -0700

----------------------------------------------------------------------
 R/pkg/vignettes/sparkr-vignettes.Rmd | 36 +++++++++++++++----------------
 1 file changed, 18 insertions(+), 18 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/d8a5a0d3/R/pkg/vignettes/sparkr-vignettes.Rmd
----------------------------------------------------------------------
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd
index b933c59..0f6d5c2 100644
--- a/R/pkg/vignettes/sparkr-vignettes.Rmd
+++ b/R/pkg/vignettes/sparkr-vignettes.Rmd
@@ -65,7 +65,7 @@ We can view the first few rows of the `SparkDataFrame` by `head` or `showDF`
fun
 head(carsDF)
 ```
 
-Common data processing operations such as `filter`, `select` are supported on the `SparkDataFrame`.
+Common data processing operations such as `filter` and `select` are supported on the `SparkDataFrame`.
 ```{r}
 carsSubDF <- select(carsDF, "model", "mpg", "hp")
 carsSubDF <- filter(carsSubDF, carsSubDF$hp >= 200)
@@ -364,7 +364,7 @@ out <- dapply(carsSubDF, function(x) { x <- cbind(x, x$mpg * 1.61)
}, schema)
 head(collect(out))
 ```
 
-Like `dapply`, apply a function to each partition of a `SparkDataFrame` and collect the result
back. The output of function should be a `data.frame`, but no schema is required in this case.
Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be
pulled to the driver and fit in driver memory.
+Like `dapply`, `dapplyCollect` can apply a function to each partition of a `SparkDataFrame`
and collect the result back. The output of the function should be a `data.frame`, but no schema
is required in this case. Note that `dapplyCollect` can fail if the output of the UDF on all
partitions cannot be pulled into the driver's memory.
 
 ```{r}
 out <- dapplyCollect(
@@ -390,7 +390,7 @@ result <- gapply(
 head(arrange(result, "max_mpg", decreasing = TRUE))
 ```
 
-Like gapply, `gapplyCollect` applies a function to each partition of a `SparkDataFrame` and
collect the result back to R `data.frame`. The output of the function should be a `data.frame`
but no schema is required in this case. Note that `gapplyCollect` can fail if the output of
UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
+Like `gapply`, `gapplyCollect` can apply a function to each partition of a `SparkDataFrame`
and collect the result back to R `data.frame`. The output of the function should be a `data.frame`
but no schema is required in this case. Note that `gapplyCollect` can fail if the output of
the UDF on all partitions cannot be pulled into the driver's memory.
 
 ```{r}
 result <- gapplyCollect(
@@ -443,20 +443,20 @@ options(ops)
 
 
 ### SQL Queries
-A `SparkDataFrame` can also be registered as a temporary view in Spark SQL and that allows
you to run SQL queries over its data. The sql function enables applications to run SQL queries
programmatically and returns the result as a `SparkDataFrame`.
+A `SparkDataFrame` can also be registered as a temporary view in Spark SQL so that one can
run SQL queries over its data. The sql function enables applications to run SQL queries programmatically
and returns the result as a `SparkDataFrame`.
 
 ```{r}
 people <- read.df(paste0(sparkR.conf("spark.home"),
                          "/examples/src/main/resources/people.json"), "json")
 ```
 
-Register this SparkDataFrame as a temporary view.
+Register this `SparkDataFrame` as a temporary view.
 
 ```{r}
 createOrReplaceTempView(people, "people")
 ```
 
-SQL statements can be run by using the sql method.
+SQL statements can be run using the sql method.
 ```{r}
 teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
 head(teenagers)
@@ -765,7 +765,7 @@ head(predict(isoregModel, newDF))
 `spark.gbt` fits a [gradient-boosted tree](https://en.wikipedia.org/wiki/Gradient_boosting)
classification or regression model on a `SparkDataFrame`.
 Users can call `summary` to get a summary of the fitted model, `predict` to make predictions,
and `write.ml`/`read.ml` to save/load fitted models.
 
-Similar to the random forest example above, we use the `longley` dataset to train a gradient-boosted
tree and make predictions:
+We use the `longley` dataset to train a gradient-boosted tree and make predictions:
 
 ```{r, warning=FALSE}
 df <- createDataFrame(longley)
@@ -805,7 +805,7 @@ head(select(fitted, "Class", "prediction"))
 
 `spark.gaussianMixture` fits multivariate [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
(GMM) against a `SparkDataFrame`. [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
(EM) is used to approximate the maximum likelihood estimator (MLE) of the model.
 
-We use a simulated example to demostrate the usage.
+We use a simulated example to demonstrate the usage.
 ```{r}
 X1 <- data.frame(V1 = rnorm(4), V2 = rnorm(4))
 X2 <- data.frame(V1 = rnorm(6, 3), V2 = rnorm(6, 4))
@@ -836,9 +836,9 @@ head(select(kmeansPredictions, "model", "mpg", "hp", "wt", "prediction"),
n = 20
 
 * Topics and documents both exist in a feature space, where feature vectors are vectors of
word counts (bag of words).
 
-* Rather than estimating a clustering using a traditional distance, LDA uses a function based
on a statistical model of how text documents are generated.
+* Rather than clustering using a traditional distance, LDA uses a function based on a statistical
model of how text documents are generated.
 
-To use LDA, we need to specify a `features` column in `data` where each entry represents
a document. There are two type options for the column:
+To use LDA, we need to specify a `features` column in `data` where each entry represents
a document. There are two options for the column:
 
 * character string: This can be a string of the whole document. It will be parsed automatically.
Additional stop words can be added in `customizedStopWords`.
 
@@ -886,7 +886,7 @@ perplexity
 
 `spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
via [alternating least squares](http://dl.acm.org/citation.cfm?id=1608614).
 
-There are multiple options that can be configured in `spark.als`, including `rank`, `reg`,
`nonnegative`. For a complete list, refer to the help file.
+There are multiple options that can be configured in `spark.als`, including `rank`, `reg`,
and `nonnegative`. For a complete list, refer to the help file.
 
 ```{r, eval=FALSE}
 ratings <- list(list(0, 0, 4.0), list(0, 1, 2.0), list(1, 1, 3.0), list(1, 2, 4.0),
@@ -966,7 +966,7 @@ testSummary
 
 
 ### Model Persistence
-The following example shows how to save/load an ML model by SparkR.
+The following example shows how to save/load an ML model in SparkR.
 ```{r}
 t <- as.data.frame(Titanic)
 training <- createDataFrame(t)
@@ -1064,19 +1064,19 @@ There are three main object classes in SparkR you may be working with.
     + `sdf` stores a reference to the corresponding Spark Dataset in the Spark JVM backend.
     + `env` saves the meta-information of the object such as `isCached`.
 
-It can be created by data import methods or by transforming an existing `SparkDataFrame`.
We can manipulate `SparkDataFrame` by numerous data processing functions and feed that into
machine learning algorithms.
+    It can be created by data import methods or by transforming an existing `SparkDataFrame`.
We can manipulate `SparkDataFrame` by numerous data processing functions and feed that into
machine learning algorithms.
 
-* `Column`: an S4 class representing column of `SparkDataFrame`. The slot `jc` saves a reference
to the corresponding Column object in the Spark JVM backend.
+* `Column`: an S4 class representing a column of `SparkDataFrame`. The slot `jc` saves a
reference to the corresponding `Column` object in the Spark JVM backend.
 
-It can be obtained from a `SparkDataFrame` by `$` operator, `df$col`. More often, it is used
together with other functions, for example, with `select` to select particular columns, with
`filter` and constructed conditions to select rows, with aggregation functions to compute
aggregate statistics for each group.
+    It can be obtained from a `SparkDataFrame` by `$` operator, e.g., `df$col`. More often,
it is used together with other functions, for example, with `select` to select particular
columns, with `filter` and constructed conditions to select rows, with aggregation functions
to compute aggregate statistics for each group.
 
-* `GroupedData`: an S4 class representing grouped data created by `groupBy` or by transforming
other `GroupedData`. Its `sgd` slot saves a reference to a RelationalGroupedDataset object
in the backend.
+* `GroupedData`: an S4 class representing grouped data created by `groupBy` or by transforming
other `GroupedData`. Its `sgd` slot saves a reference to a `RelationalGroupedDataset` object
in the backend.
 
-This is often an intermediate object with group information and followed up by aggregation
operations.
+    This is often an intermediate object with group information and followed up by aggregation
operations.
 
 ### Architecture
 
-A complete description of architecture can be seen in reference, in particular the paper
*SparkR: Scaling R Programs with Spark*.
+A complete description of architecture can be seen in the references, in particular the paper
*SparkR: Scaling R Programs with Spark*.
 
 Under the hood of SparkR is Spark SQL engine. This avoids the overheads of running interpreted
R code, and the optimized SQL execution engine in Spark uses structural information about
data and computation flow to perform a bunch of optimizations to speed up the computation.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message