spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jkbrad...@apache.org
Subject [1/2] spark git commit: [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide
Date Fri, 15 Jul 2016 20:38:28 GMT
Repository: spark
Updated Branches:
  refs/heads/master 71ad945bb -> 5ffd5d383


http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 17fd3e1..30112c7 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -1,32 +1,12 @@
 ---
 layout: global
-title: MLlib
-displayTitle: Machine Learning Library (MLlib) Guide
-description: MLlib machine learning library overview for Spark SPARK_VERSION_SHORT
+title: "MLlib: RDD-based API"
+displayTitle: "MLlib: RDD-based API"
 ---
 
-MLlib is Spark's machine learning (ML) library.
-Its goal is to make practical machine learning scalable and easy.
-It consists of common learning algorithms and utilities, including classification, regression,
-clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization
-primitives and higher-level pipeline APIs.
-
-It divides into two packages:
-
-* [`spark.mllib`](mllib-guide.html#data-types-algorithms-and-utilities) contains the original
API
-  built on top of [RDDs](programming-guide.html#resilient-distributed-datasets-rdds).
-* [`spark.ml`](ml-guide.html) provides higher-level API
-  built on top of [DataFrames](sql-programming-guide.html#dataframes) for constructing ML
pipelines.
-
-Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
-But we will keep supporting `spark.mllib` along with the development of `spark.ml`.
-Users should be comfortable using `spark.mllib` features and expect more features coming.
-Developers should contribute new algorithms to `spark.ml` if they fit the ML pipeline concept
well,
-e.g., feature extractors and transformers.
-
-We list major functionality from both below, with links to detailed guides.
-
-# spark.mllib: data types, algorithms, and utilities
+This page documents sections of the MLlib guide for the RDD-based API (the `spark.mllib`
package).
+Please see the [MLlib Main Guide](ml-guide.html) for the DataFrame-based API (the `spark.ml`
package),
+which is now the primary API for MLlib.
 
 * [Data types](mllib-data-types.html)
 * [Basic statistics](mllib-statistics.html)
@@ -65,192 +45,3 @@ We list major functionality from both below, with links to detailed guides.
   * [stochastic gradient descent](mllib-optimization.html#stochastic-gradient-descent-sgd)
   * [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
 
-# spark.ml: high-level APIs for ML pipelines
-
-* [Overview: estimators, transformers and pipelines](ml-guide.html)
-* [Extracting, transforming and selecting features](ml-features.html)
-* [Classification and regression](ml-classification-regression.html)
-* [Clustering](ml-clustering.html)
-* [Collaborative filtering](ml-collaborative-filtering.html)
-* [Advanced topics](ml-advanced.html)
-
-Some techniques are not available yet in spark.ml, most notably dimensionality reduction

-Users can seamlessly combine the implementation of these techniques found in `spark.mllib`
with the rest of the algorithms found in `spark.ml`.
-
-# Dependencies
-
-MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
-[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
-If natives libraries[^1] are not available at runtime, you will see a warning message and
a pure JVM
-implementation will be used instead.
-
-Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s
native
-proxies by default.
-To configure `netlib-java` / Breeze to use system optimised binaries, include
-`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency
of your
-project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for
your
-platform's additional installation instructions.
-
-To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
-
-[^1]: To learn more about the benefits and background of system optimised natives, you may
wish to
-    watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
-
-# Migration guide
-
-MLlib is under active development.
-The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
-and the migration guide below will explain all changes between releases.
-
-## From 1.6 to 2.0
-
-### Breaking changes
-
-There were several breaking changes in Spark 2.0, which are outlined below.
-
-**Linear algebra classes for DataFrame-based APIs**
-
-Spark's linear algebra dependencies were moved to a new project, `mllib-local` 
-(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). 
-As part of this change, the linear algebra classes were copied to a new package, `spark.ml.linalg`.

-The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes, 
-leading to a few breaking changes, predominantly in various model classes 
-(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list).
-
-**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package
`spark.mllib.linalg`.
-
-_Converting vectors and matrices_
-
-While most pipeline components support backward compatibility for loading, 
-some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that contain vector
or matrix 
-columns, may need to be migrated to the new `spark.ml` vector and matrix types. 
-Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to `spark.ml.linalg`
types
-(and vice versa) can be found in `spark.mllib.util.MLUtils`.
-
-There are also utility methods available for converting single instances of 
-vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / `mllib.linalg.Matrix`
-for converting to `ml.linalg` types, and 
-`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` 
-for converting to `mllib.linalg` types.
-
-<div class="codetabs">
-<div data-lang="scala"  markdown="1">
-
-{% highlight scala %}
-import org.apache.spark.mllib.util.MLUtils
-
-// convert DataFrame columns
-val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-// convert a single vector or matrix
-val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
-val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
-{% endhighlight %}
-
-Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$)
for further detail.
-</div>
-
-<div data-lang="java" markdown="1">
-
-{% highlight java %}
-import org.apache.spark.mllib.util.MLUtils;
-import org.apache.spark.sql.Dataset;
-
-// convert DataFrame columns
-Dataset<Row> convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
-Dataset<Row> convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
-// convert a single vector or matrix
-org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
-org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
-{% endhighlight %}
-
-Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for
further detail.
-</div>
-
-<div data-lang="python"  markdown="1">
-
-{% highlight python %}
-from pyspark.mllib.util import MLUtils
-
-# convert DataFrame columns
-convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-# convert a single vector or matrix
-mlVec = mllibVec.asML()
-mlMat = mllibMat.asML()
-{% endhighlight %}
-
-Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils)
for further detail.
-</div>
-</div>
-
-**Deprecated methods removed**
-
-Several deprecated methods were removed in the `spark.mllib` and `spark.ml` packages:
-
-* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
-* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
-* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as `DeveloperApi`)
-* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these functions are available
on `RDD`s directly, and were marked as `DeveloperApi`)
-* `defaultStategy` in `mllib.tree.configuration.Strategy`
-* `build` in `mllib.tree.Node`
-* libsvm loaders for multiclass and load/save labeledData methods in `mllib.util.MLUtils`
-
-A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
-
-### Deprecations and changes of behavior
-
-**Deprecations**
-
-Deprecations in the `spark.mllib` and `spark.ml` packages include:
-
-* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
- In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
-* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
- In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
- the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
-* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
- In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
- We move all functionality in overridden methods to the corresponding `transformSchema`.
-* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
- In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD`
and `LogisticRegressionWithSGD` have been deprecated.
- We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
-* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
- In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and
`fMeasure` have been deprecated in favor of `accuracy`.
-* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
- In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been
deprecated in favor of `session`.
-* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated
since it was not used by `ChiSqSelectorModel`.
-
-**Changes of behavior**
-
-Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
-
-* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
- `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson`
for binary classification now.
- This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
-    * The intercept will not be regularized when training binary classification model with
L1/L2 Updater.
-    * If users set without regularization, training with or without feature scaling will
return the same solution by the same convergence rate.
-* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
- In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
- the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol`
has been changed from 1E-4 to 1E-6.
-* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
- Fix a bug of `PowerIterationClustering` which will likely change its result.
-* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
- `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing
is being used.
-* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
- `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
-* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
- `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
-* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
- The `expectedType` argument for PySpark `Param` was removed.
-* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
- Some default `Param` values, which were mismatched between pipelines in Scala and Python,
have been changed.
-* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
- `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find
splits (previously used custom sampling logic).
- The output buckets will differ for same input data and params.
-
-## Previous Spark versions
-
-Earlier migration guides are archived [on this page](mllib-migration-guides.html).
-
----

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-isotonic-regression.md
----------------------------------------------------------------------
diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md
index 8ede440..d90905a 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Isotonic regression - spark.mllib
-displayTitle: Regression - spark.mllib
+title: Isotonic regression - RDD-based API
+displayTitle: Regression - RDD-based API
 ---
 
 ## Isotonic regression

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-linear-methods.md
----------------------------------------------------------------------
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 17d781a..6fcd3ae 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Linear Methods - spark.mllib
-displayTitle: Linear Methods - spark.mllib
+title: Linear Methods - RDD-based API
+displayTitle: Linear Methods - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-migration-guides.md
----------------------------------------------------------------------
diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md
index 970c669..ea6f93f 100644
--- a/docs/mllib-migration-guides.md
+++ b/docs/mllib-migration-guides.md
@@ -1,159 +1,9 @@
 ---
 layout: global
-title: Old Migration Guides - spark.mllib
-displayTitle: Old Migration Guides - spark.mllib
-description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
+title: Old Migration Guides - MLlib
+displayTitle: Old Migration Guides - MLlib
 ---
 
-The migration guide for the current Spark version is kept on the [MLlib Programming Guide
main page](mllib-guide.html#migration-guide).
-
-## From 1.5 to 1.6
-
-There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there
are
-deprecations and changes of behavior.
-
-Deprecations:
-
-* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
- In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
-* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
- In `spark.ml.classification.LogisticRegressionModel` and
- `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in
favor of
- the new name `coefficients`.  This helps disambiguate from instance (row) "weights" given
to
- algorithms.
-
-Changes of behavior:
-
-* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
- `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
- Previously, it was a threshold for absolute change in error. Now, it resembles the behavior
of
- `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative
to the
- previous error); for small errors (`< 0.01`), it uses absolute error.
-* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
- `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
- tokenizing. Now, it converts to lowercase by default, with an option not to. This matches
the
- behavior of the simpler `Tokenizer` transformer.
-
-## From 1.4 to 1.5
-
-In the `spark.mllib` package, there are no breaking API changes but several behavior changes:
-
-* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
-  `RegressionMetrics.explainedVariance` returns the average regression sum of squares.
-* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels`
become
-  sorted.
-* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a
default
-  convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.
-
-In the `spark.ml` package, there exists one breaking API change and one behavior change:
-
-* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support
is removed
-  from `Params.setDefault` due to a
-  [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
-* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter`
is
-  added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.
-
-## From 1.3 to 1.4
-
-In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi`
or `Experimental` APIs:
-
-* Gradient-Boosted Trees
-    * *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss)
method was changed.  This is only an issues for users who wrote their own losses for GBTs.
-    * *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy)
have been changed because of a modification to the case class fields.  This could be an issue
for users who use `BoostingStrategy` to set GBT parameters.
-* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
has changed.  It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`.
 The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends
on the optimization algorithm.
-
-In the `spark.ml` package, several major API changes occurred, including:
-
-* `Param` and other APIs for specifying parameters
-* `uid` unique IDs for Pipeline components
-* Reorganization of certain classes
-
-Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all changes
here.
-However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on
any API
-changes for future releases.
-
-## From 1.2 to 1.3
-
-In the `spark.mllib` package, there were several breaking changes.  The first change (in
`ALS`) is the only one in a component not marked as Alpha or Experimental.
-
-* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS),
the extraneous method `solveLeastSquares` has been removed.  The `DeveloperApi` method `analyzeBlocks`
was also removed.
-* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel)
remains an Alpha component. In it, the `variance` method has been replaced with the `std`
method.  To compute the column variance values returned by the original `variance` method,
simply square the standard deviation values returned by `std`.
-* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD)
remains an Experimental component.  In it, there were two changes:
-    * The constructor taking arguments was removed in favor of a builder pattern using the
default constructor plus parameter setter methods.
-    * Variable `model` is no longer public.
-* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
remains an Experimental component.  In it and its associated classes, there were several changes:
-    * In `DecisionTree`, the deprecated class method `train` has been removed.  (The object/static
`train` methods remain.)
-    * In `Strategy`, the `checkpointDir` parameter has been removed.  Checkpointing is still
supported, but the checkpoint directory must be set before calling tree and tree ensemble
training.
-* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API
but is now private, declared `private[python]`.  This was never meant for external use.
-* In linear regression (including Lasso and ridge regression), the squared loss is now divided
by 2.
-  So in order to produce the same result as in 1.2, the regularization parameter needs to
be divided by 2 and the step size needs to be multiplied by 2.
-
-In the `spark.ml` package, the main API changes are from Spark SQL.  We list the most important
changes here:
-
-* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD)
has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with
a somewhat modified API.  All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
-* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s
by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`.  These
implicits have been moved, so we now call `import sqlContext.implicits._`.
-* Java APIs for SQL have also changed accordingly.  Please see the examples above and the
[Spark SQL Programming Guide](sql-programming-guide.html) for details.
-
-Other changes were in `LogisticRegression`:
-
-* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol`
(with default value "probability").  The type was originally `Double` (for the probability
of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass
classification in the future).
-* In Spark 1.2, `LogisticRegressionModel` did not include an intercept.  In Spark 1.3, it
includes an intercept; however, it will always be 0.0 since it uses the default settings for
[spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS).
 The option to use an intercept will be added in the future.
-
-## From 1.1 to 1.2
-
-The only API changes in MLlib v1.2 are in
-[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-which continues to be an experimental API in MLlib 1.2:
-
-1. *(Breaking change)* The Scala API for classification takes a named argument specifying
the number
-of classes.  In MLlib v1.1, this argument was called `numClasses` in Python and
-`numClassesForClassification` in Scala.  In MLlib v1.2, the names are both set to `numClasses`.
-This `numClasses` parameter is specified either via
-[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
-or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
-static `trainClassifier` and `trainRegressor` methods.
-
-2. *(Breaking change)* The API for
-[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has changed.
-This should generally not affect user code, unless the user manually constructs decision
trees
-(instead of using the `trainClassifier` or `trainRegressor` methods).
-The tree `Node` now includes more information, including the probability of the predicted
label
-(for classification).
-
-3. Printing methods' output has changed.  The `toString` (Scala/Java) and `__repr__` (Python)
methods used to print the full model; they now print a summary.  For the full model, use `toDebugString`.
-
-Examples in the Spark distribution and examples in the
-[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated accordingly.
-
-## From 1.0 to 1.1
-
-The only API changes in MLlib v1.1 are in
-[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-which continues to be an experimental API in MLlib 1.1:
-
-1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match
-the implementations of trees in
-[scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
-and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
-In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf
nodes.
-In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf
nodes.
-This depth is specified by the `maxDepth` parameter in
-[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
-or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
-static `trainClassifier` and `trainRegressor` methods.
-
-2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor`
-methods to build a [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-rather than using the old parameter class `Strategy`.  These new training methods explicitly
-separate classification and regression, and they replace specialized parameter types with
-simple `String` types.
-
-Examples of the new, recommended `trainClassifier` and `trainRegressor` are given in the
-[Decision Trees Guide](mllib-decision-tree.html#examples).
-
-## From 0.9 to 1.0
-
-In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces
a few
-breaking changes.  If your data is sparse, please store it in a sparse format instead of
dense to
-take advantage of sparsity in both storage and computation. Details are described below.
+The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
 
+Past migration guides are now stored at [ml-migration-guides.html](ml-migration-guides.html).

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-naive-bayes.md
----------------------------------------------------------------------
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index d0d594a..7471d18 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Naive Bayes - spark.mllib
-displayTitle: Naive Bayes - spark.mllib
+title: Naive Bayes - RDD-based API
+displayTitle: Naive Bayes - RDD-based API
 ---
 
 [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-optimization.md
----------------------------------------------------------------------
diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md
index f90b66f..eefd7dc 100644
--- a/docs/mllib-optimization.md
+++ b/docs/mllib-optimization.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Optimization - spark.mllib
-displayTitle: Optimization - spark.mllib
+title: Optimization - RDD-based API
+displayTitle: Optimization - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-pmml-model-export.md
----------------------------------------------------------------------
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md
index 7f2347d..d353090 100644
--- a/docs/mllib-pmml-model-export.md
+++ b/docs/mllib-pmml-model-export.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: PMML model export - spark.mllib
-displayTitle: PMML model export - spark.mllib
+title: PMML model export - RDD-based API
+displayTitle: PMML model export - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-statistics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 329855e..12797bd 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Basic Statistics - spark.mllib
-displayTitle: Basic Statistics - spark.mllib
+title: Basic Statistics - RDD-based API
+displayTitle: Basic Statistics - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 2bc4912..888c12f 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1571,7 +1571,7 @@ have changed from returning (key, list of values) pairs to (key, iterable
of val
 </div>
 
 Migration guides are also available for [Spark Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x),
-[MLlib](mllib-guide.html#migration-guide) and [GraphX](graphx-programming-guide.html#migrating-from-spark-091).
+[MLlib](ml-guide.html#migration-guide) and [GraphX](graphx-programming-guide.html#migrating-from-spark-091).
 
 
 # Where to Go from Here

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index 2ee3b80..de82a06 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -15,7 +15,7 @@ like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex
 algorithms expressed with high-level functions like `map`, `reduce`, `join` and `window`.
 Finally, processed data can be pushed out to filesystems, databases,
 and live dashboards. In fact, you can apply Spark's
-[machine learning](mllib-guide.html) and
+[machine learning](ml-guide.html) and
 [graph processing](graphx-programming-guide.html) algorithms on data streams.
 
 <p style="text-align: center;">
@@ -1673,7 +1673,7 @@ See the [DataFrames and SQL](sql-programming-guide.html) guide to learn
more abo
 ***
 
 ## MLlib Operations
-You can also easily use machine learning algorithms provided by [MLlib](mllib-guide.html).
First of all, there are streaming machine learning algorithms (e.g. [Streaming Linear Regression](mllib-linear-methods.html#streaming-linear-regression),
[Streaming KMeans](mllib-clustering.html#streaming-k-means), etc.) which can simultaneously
learn from the streaming data as well as apply the model on the streaming data. Beyond these,
for a much larger class of machine learning algorithms, you can learn a learning model offline
(i.e. using historical data) and then apply the model online on streaming data. See the [MLlib](mllib-guide.html)
guide for more details.
+You can also easily use machine learning algorithms provided by [MLlib](ml-guide.html). First
of all, there are streaming machine learning algorithms (e.g. [Streaming Linear Regression](mllib-linear-methods.html#streaming-linear-regression),
[Streaming KMeans](mllib-clustering.html#streaming-k-means), etc.) which can simultaneously
learn from the streaming data as well as apply the model on the streaming data. Beyond these,
for a much larger class of machine learning algorithms, you can learn a learning model offline
(i.e. using historical data) and then apply the model online on streaming data. See the [MLlib](ml-guide.html)
guide for more details.
 
 ***
 

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/python/pyspark/ml/__init__.py
----------------------------------------------------------------------
diff --git a/python/pyspark/ml/__init__.py b/python/pyspark/ml/__init__.py
index 05f3be5..1d42d49 100644
--- a/python/pyspark/ml/__init__.py
+++ b/python/pyspark/ml/__init__.py
@@ -16,8 +16,8 @@
 #
 
 """
-Spark ML is a component that adds a new set of machine learning APIs to let users quickly
-assemble and configure practical machine learning pipelines.
+DataFrame-based machine learning APIs to let users quickly assemble and configure practical
+machine learning pipelines.
 """
 from pyspark.ml.base import Estimator, Model, Transformer
 from pyspark.ml.pipeline import Pipeline, PipelineModel

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/python/pyspark/ml/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 24efce8..4bcb2c4 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -16,7 +16,7 @@
 #
 
 """
-Unit tests for Spark ML Python APIs.
+Unit tests for MLlib Python DataFrame-based APIs.
 """
 import sys
 if sys.version > '3':

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/python/pyspark/mllib/__init__.py
----------------------------------------------------------------------
diff --git a/python/pyspark/mllib/__init__.py b/python/pyspark/mllib/__init__.py
index acba3a7..ae26521 100644
--- a/python/pyspark/mllib/__init__.py
+++ b/python/pyspark/mllib/__init__.py
@@ -16,7 +16,10 @@
 #
 
 """
-Python bindings for MLlib.
+RDD-based machine learning APIs for Python (in maintenance mode).
+
+The `pyspark.mllib` package is in maintenance mode as of the Spark 2.0.0 release to encourage
+migration to the DataFrame-based APIs under the `pyspark.ml` package.
 """
 from __future__ import absolute_import
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message