spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mengxr <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-10348] [MLLIB] updates ml-guide
Date Sun, 30 Aug 2015 06:30:01 GMT
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8517#discussion_r38269010
  
    --- Diff: docs/ml-guide.md ---
    @@ -24,61 +24,74 @@ title: Spark ML Programming Guide
     The `spark.ml` package aims to provide a uniform set of high-level APIs built on top
of
     [DataFrames](sql-programming-guide.html#dataframes) that help users create and tune practical
     machine learning pipelines.
    -See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages
of
    +See the [algorithm guides](#algorithm-guides) section below for guides on sub-packages
of
     `spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and
more.
     
    -**Table of Contents**
    +**Table of contents**
     
     * This will become a table of contents (this text will be scraped).
     {:toc}
     
    -# Main Concepts
    +# Main concepts
     
    -Spark ML standardizes APIs for machine learning algorithms to make it easier to combine
multiple algorithms into a single pipeline, or workflow.  This section covers the key concepts
introduced by the Spark ML API.
    +Spark ML standardizes APIs for machine learning algorithms to make it easier to combine
multiple
    +algorithms into a single pipeline, or workflow.
    +This section covers the key concepts introduced by the Spark ML API, where the pipeline
concept is
    +mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
     
    -* **[ML Dataset](ml-guide.html#ml-dataset)**: Spark ML uses the [`DataFrame`](api/scala/index.html#org.apache.spark.sql.DataFrame)
from Spark SQL as a dataset which can hold a variety of data types.
    -E.g., a dataset could have different columns storing text, feature vectors, true labels,
and predictions.
    +* **[`DataFrame`](ml-guide.html#dataframe)**: Spark ML uses `DataFrame` from Spark SQL
as an ML
    +  dataset, which can hold a variety of data types.
    +  E.g., a `DataFrame` could have different columns storing text, feature vectors, true
labels, and predictions.
     
     * **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which
can transform one `DataFrame` into another `DataFrame`.
    -E.g., an ML model is a `Transformer` which transforms an RDD with features into an RDD
with predictions.
    +E.g., an ML model is a `Transformer` which transforms `DataFrame` with features into
a `DataFrame` with predictions.
     
     * **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can
be fit on a `DataFrame` to produce a `Transformer`.
    -E.g., a learning algorithm is an `Estimator` which trains on a dataset and produces a
model.
    +E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces
a model.
     
     * **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s
and `Estimator`s together to specify an ML workflow.
     
    -* **[`Param`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share
a common API for specifying parameters.
    +* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now
share a common API for specifying parameters.
     
    -## ML Dataset
    +## DataFrame
     
     Machine learning can be applied to a wide variety of data types, such as vectors, text,
images, and structured data.
    -Spark ML adopts the [`DataFrame`](api/scala/index.html#org.apache.spark.sql.DataFrame)
from Spark SQL in order to support a variety of data types under a unified Dataset concept.
    +Spark ML adopts the `DataFrame` from Spark SQL in order to support a variety of data
types.
    --- End diff --
    
    I thought about this but didn't figure out a good solution. Using `spark.ml` everywhere
is accurate but it makes the guide a little bit strange to read. Another solution is to define
`Spark ML` precisely somewhere in the doc. Let me think about this and make a new PR if necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message