spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yh...@apache.org
Subject [18/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
Date Wed, 28 Dec 2016 22:35:29 GMT
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-tuning.html
----------------------------------------------------------------------
diff --git a/site/docs/2.1.0/ml-tuning.html b/site/docs/2.1.0/ml-tuning.html
index 0c36a98..2246cc2 100644
--- a/site/docs/2.1.0/ml-tuning.html
+++ b/site/docs/2.1.0/ml-tuning.html
@@ -329,13 +329,13 @@ Built-in Cross-Validation and other tooling allow users to optimize hyperparamet
 <p><strong>Table of contents</strong></p>
 
 <ul id="markdown-toc">
-  <li><a href="#model-selection-aka-hyperparameter-tuning" id="markdown-toc-model-selection-aka-hyperparameter-tuning">Model selection (a.k.a. hyperparameter tuning)</a></li>
-  <li><a href="#cross-validation" id="markdown-toc-cross-validation">Cross-Validation</a>    <ul>
-      <li><a href="#example-model-selection-via-cross-validation" id="markdown-toc-example-model-selection-via-cross-validation">Example: model selection via cross-validation</a></li>
+  <li><a href="#model-selection-aka-hyperparameter-tuning">Model selection (a.k.a. hyperparameter tuning)</a></li>
+  <li><a href="#cross-validation">Cross-Validation</a>    <ul>
+      <li><a href="#example-model-selection-via-cross-validation">Example: model selection via cross-validation</a></li>
     </ul>
   </li>
-  <li><a href="#train-validation-split" id="markdown-toc-train-validation-split">Train-Validation Split</a>    <ul>
-      <li><a href="#example-model-selection-via-train-validation-split" id="markdown-toc-example-model-selection-via-train-validation-split">Example: model selection via train validation split</a></li>
+  <li><a href="#train-validation-split">Train-Validation Split</a>    <ul>
+      <li><a href="#example-model-selection-via-train-validation-split">Example: model selection via train validation split</a></li>
     </ul>
   </li>
 </ul>
@@ -396,7 +396,7 @@ However, it is also a well-established method for choosing parameters which is m
 
 Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) for details on the API.
 
-<div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.ml.Pipeline</span>
+<div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.ml.Pipeline</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.ml.classification.LogisticRegression</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.ml.evaluation.BinaryClassificationEvaluator</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.ml.feature.</span><span class="o">{</span><span class="nc">HashingTF</span><span class="o">,</span> <span class="nc">Tokenizer</span><span class="o">}</span>
@@ -467,7 +467,7 @@ Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark
   <span class="o">.</span><span class="n">select</span><span class="o">(</span><span class="s">&quot;id&quot;</span><span class="o">,</span> <span class="s">&quot;text&quot;</span><span class="o">,</span> <span class="s">&quot;probability&quot;</span><span class="o">,</span> <span class="s">&quot;prediction&quot;</span><span class="o">)</span>
   <span class="o">.</span><span class="n">collect</span><span class="o">()</span>
   <span class="o">.</span><span class="n">foreach</span> <span class="o">{</span> <span class="k">case</span> <span class="nc">Row</span><span class="o">(</span><span class="n">id</span><span class="k">:</span> <span class="kt">Long</span><span class="o">,</span> <span class="n">text</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">prob</span><span class="k">:</span> <span class="kt">Vector</span><span class="o">,</span> <span class="n">prediction</span><span class="k">:</span> <span class="kt">Double</span><span class="o">)</span> <span class="k">=&gt;</span>
-    <span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;($id, $text) --&gt; prob=$prob, prediction=$prediction&quot;</span><span class="o">)</span>
+    <span class="n">println</span><span class="o">(</span><span class="s">s&quot;(</span><span class="si">$id</span><span class="s">, </span><span class="si">$text</span><span class="s">) --&gt; prob=</span><span class="si">$prob</span><span class="s">, prediction=</span><span class="si">$prediction</span><span class="s">&quot;</span><span class="o">)</span>
   <span class="o">}</span>
 </pre></div><div><small>Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala" in the Spark repo.</small></div>
 </div>
@@ -476,7 +476,7 @@ Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark
 
 Refer to the [`CrossValidator` Java docs](api/java/org/apache/spark/ml/tuning/CrossValidator.html) for details on the API.
 
-<div class="highlight"><pre><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
+<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.util.Arrays</span><span class="o">;</span>
 
 <span class="kn">import</span> <span class="nn">org.apache.spark.ml.Pipeline</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.ml.PipelineStage</span><span class="o">;</span>
@@ -493,38 +493,38 @@ Refer to the [`CrossValidator` Java docs](api/java/org/apache/spark/ml/tuning/Cr
 
 <span class="c1">// Prepare training documents, which are labeled.</span>
 <span class="n">Dataset</span><span class="o">&lt;</span><span class="n">Row</span><span class="o">&gt;</span> <span class="n">training</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">createDataFrame</span><span class="o">(</span><span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">0L</span><span class="o">,</span> <span class="s">&quot;a b c d e spark&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">1L</span><span class="o">,</span> <span class="s">&quot;b d&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">2L</span><span class="o">,</span><span class="s">&quot;spark f g h&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">3L</span><span class="o">,</span> <span class="s">&quot;hadoop mapreduce&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">4L</span><span class="o">,</span> <span class="s">&quot;b spark who&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">5L</span><span class="o">,</span> <span class="s">&quot;g d a y&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">6L</span><span class="o">,</span> <span class="s">&quot;spark fly&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">7L</span><span class="o">,</span> <span class="s">&quot;was mapreduce&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">8L</span><span class="o">,</span> <span class="s">&quot;e spark program&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">9L</span><span class="o">,</span> <span class="s">&quot;a e c l&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">10L</span><span class="o">,</span> <span class="s">&quot;spark compile&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaLabeledDocument</span><span class="o">(</span><span class="mi">11L</span><span class="o">,</span> <span class="s">&quot;hadoop software&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">0</span><span class="n">L</span><span class="o">,</span> <span class="s">&quot;a b c d e spark&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">1L</span><span class="o">,</span> <span class="s">&quot;b d&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">2L</span><span class="o">,</span><span class="s">&quot;spark f g h&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">3L</span><span class="o">,</span> <span class="s">&quot;hadoop mapreduce&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">4L</span><span class="o">,</span> <span class="s">&quot;b spark who&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">5L</span><span class="o">,</span> <span class="s">&quot;g d a y&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">6L</span><span class="o">,</span> <span class="s">&quot;spark fly&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">7L</span><span class="o">,</span> <span class="s">&quot;was mapreduce&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">8L</span><span class="o">,</span> <span class="s">&quot;e spark program&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">9L</span><span class="o">,</span> <span class="s">&quot;a e c l&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">10L</span><span class="o">,</span> <span class="s">&quot;spark compile&quot;</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaLabeledDocument</span><span class="o">(</span><span class="mi">11L</span><span class="o">,</span> <span class="s">&quot;hadoop software&quot;</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">)</span>
 <span class="o">),</span> <span class="n">JavaLabeledDocument</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
 
 <span class="c1">// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.</span>
-<span class="n">Tokenizer</span> <span class="n">tokenizer</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">Tokenizer</span><span class="o">()</span>
+<span class="n">Tokenizer</span> <span class="n">tokenizer</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Tokenizer</span><span class="o">()</span>
   <span class="o">.</span><span class="na">setInputCol</span><span class="o">(</span><span class="s">&quot;text&quot;</span><span class="o">)</span>
   <span class="o">.</span><span class="na">setOutputCol</span><span class="o">(</span><span class="s">&quot;words&quot;</span><span class="o">);</span>
-<span class="n">HashingTF</span> <span class="n">hashingTF</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">HashingTF</span><span class="o">()</span>
+<span class="n">HashingTF</span> <span class="n">hashingTF</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HashingTF</span><span class="o">()</span>
   <span class="o">.</span><span class="na">setNumFeatures</span><span class="o">(</span><span class="mi">1000</span><span class="o">)</span>
   <span class="o">.</span><span class="na">setInputCol</span><span class="o">(</span><span class="n">tokenizer</span><span class="o">.</span><span class="na">getOutputCol</span><span class="o">())</span>
   <span class="o">.</span><span class="na">setOutputCol</span><span class="o">(</span><span class="s">&quot;features&quot;</span><span class="o">);</span>
-<span class="n">LogisticRegression</span> <span class="n">lr</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">LogisticRegression</span><span class="o">()</span>
+<span class="n">LogisticRegression</span> <span class="n">lr</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LogisticRegression</span><span class="o">()</span>
   <span class="o">.</span><span class="na">setMaxIter</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span>
   <span class="o">.</span><span class="na">setRegParam</span><span class="o">(</span><span class="mf">0.01</span><span class="o">);</span>
-<span class="n">Pipeline</span> <span class="n">pipeline</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">Pipeline</span><span class="o">()</span>
+<span class="n">Pipeline</span> <span class="n">pipeline</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Pipeline</span><span class="o">()</span>
   <span class="o">.</span><span class="na">setStages</span><span class="o">(</span><span class="k">new</span> <span class="n">PipelineStage</span><span class="o">[]</span> <span class="o">{</span><span class="n">tokenizer</span><span class="o">,</span> <span class="n">hashingTF</span><span class="o">,</span> <span class="n">lr</span><span class="o">});</span>
 
 <span class="c1">// We use a ParamGridBuilder to construct a grid of parameters to search over.</span>
 <span class="c1">// With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,</span>
 <span class="c1">// this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.</span>
-<span class="n">ParamMap</span><span class="o">[]</span> <span class="n">paramGrid</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">ParamGridBuilder</span><span class="o">()</span>
+<span class="n">ParamMap</span><span class="o">[]</span> <span class="n">paramGrid</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ParamGridBuilder</span><span class="o">()</span>
   <span class="o">.</span><span class="na">addGrid</span><span class="o">(</span><span class="n">hashingTF</span><span class="o">.</span><span class="na">numFeatures</span><span class="o">(),</span> <span class="k">new</span> <span class="kt">int</span><span class="o">[]</span> <span class="o">{</span><span class="mi">10</span><span class="o">,</span> <span class="mi">100</span><span class="o">,</span> <span class="mi">1000</span><span class="o">})</span>
   <span class="o">.</span><span class="na">addGrid</span><span class="o">(</span><span class="n">lr</span><span class="o">.</span><span class="na">regParam</span><span class="o">(),</span> <span class="k">new</span> <span class="kt">double</span><span class="o">[]</span> <span class="o">{</span><span class="mf">0.1</span><span class="o">,</span> <span class="mf">0.01</span><span class="o">})</span>
   <span class="o">.</span><span class="na">build</span><span class="o">();</span>
@@ -534,9 +534,9 @@ Refer to the [`CrossValidator` Java docs](api/java/org/apache/spark/ml/tuning/Cr
 <span class="c1">// A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.</span>
 <span class="c1">// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric</span>
 <span class="c1">// is areaUnderROC.</span>
-<span class="n">CrossValidator</span> <span class="n">cv</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">CrossValidator</span><span class="o">()</span>
+<span class="n">CrossValidator</span> <span class="n">cv</span> <span class="o">=</span> <span class="k">new</span> <span class="n">CrossValidator</span><span class="o">()</span>
   <span class="o">.</span><span class="na">setEstimator</span><span class="o">(</span><span class="n">pipeline</span><span class="o">)</span>
-  <span class="o">.</span><span class="na">setEvaluator</span><span class="o">(</span><span class="k">new</span> <span class="nf">BinaryClassificationEvaluator</span><span class="o">())</span>
+  <span class="o">.</span><span class="na">setEvaluator</span><span class="o">(</span><span class="k">new</span> <span class="n">BinaryClassificationEvaluator</span><span class="o">())</span>
   <span class="o">.</span><span class="na">setEstimatorParamMaps</span><span class="o">(</span><span class="n">paramGrid</span><span class="o">).</span><span class="na">setNumFolds</span><span class="o">(</span><span class="mi">2</span><span class="o">);</span>  <span class="c1">// Use 3+ in practice</span>
 
 <span class="c1">// Run cross-validation, and choose the best set of parameters.</span>
@@ -544,10 +544,10 @@ Refer to the [`CrossValidator` Java docs](api/java/org/apache/spark/ml/tuning/Cr
 
 <span class="c1">// Prepare test documents, which are unlabeled.</span>
 <span class="n">Dataset</span><span class="o">&lt;</span><span class="n">Row</span><span class="o">&gt;</span> <span class="n">test</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="na">createDataFrame</span><span class="o">(</span><span class="n">Arrays</span><span class="o">.</span><span class="na">asList</span><span class="o">(</span>
-  <span class="k">new</span> <span class="nf">JavaDocument</span><span class="o">(</span><span class="mi">4L</span><span class="o">,</span> <span class="s">&quot;spark i j k&quot;</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaDocument</span><span class="o">(</span><span class="mi">5L</span><span class="o">,</span> <span class="s">&quot;l m n&quot;</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaDocument</span><span class="o">(</span><span class="mi">6L</span><span class="o">,</span> <span class="s">&quot;mapreduce spark&quot;</span><span class="o">),</span>
-  <span class="k">new</span> <span class="nf">JavaDocument</span><span class="o">(</span><span class="mi">7L</span><span class="o">,</span> <span class="s">&quot;apache hadoop&quot;</span><span class="o">)</span>
+  <span class="k">new</span> <span class="n">JavaDocument</span><span class="o">(</span><span class="mi">4L</span><span class="o">,</span> <span class="s">&quot;spark i j k&quot;</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaDocument</span><span class="o">(</span><span class="mi">5L</span><span class="o">,</span> <span class="s">&quot;l m n&quot;</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaDocument</span><span class="o">(</span><span class="mi">6L</span><span class="o">,</span> <span class="s">&quot;mapreduce spark&quot;</span><span class="o">),</span>
+  <span class="k">new</span> <span class="n">JavaDocument</span><span class="o">(</span><span class="mi">7L</span><span class="o">,</span> <span class="s">&quot;apache hadoop&quot;</span><span class="o">)</span>
 <span class="o">),</span> <span class="n">JavaDocument</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
 
 <span class="c1">// Make predictions on test documents. cvModel uses the best model found (lrModel).</span>
@@ -563,40 +563,40 @@ Refer to the [`CrossValidator` Java docs](api/java/org/apache/spark/ml/tuning/Cr
 
 Refer to the [`CrossValidator` Python docs](api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator) for more details on the API.
 
-<div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.ml</span> <span class="kn">import</span> <span class="n">Pipeline</span>
+<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.ml</span> <span class="kn">import</span> <span class="n">Pipeline</span>
 <span class="kn">from</span> <span class="nn">pyspark.ml.classification</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
 <span class="kn">from</span> <span class="nn">pyspark.ml.evaluation</span> <span class="kn">import</span> <span class="n">BinaryClassificationEvaluator</span>
 <span class="kn">from</span> <span class="nn">pyspark.ml.feature</span> <span class="kn">import</span> <span class="n">HashingTF</span><span class="p">,</span> <span class="n">Tokenizer</span>
 <span class="kn">from</span> <span class="nn">pyspark.ml.tuning</span> <span class="kn">import</span> <span class="n">CrossValidator</span><span class="p">,</span> <span class="n">ParamGridBuilder</span>
 
-<span class="c"># Prepare training documents, which are labeled.</span>
+<span class="c1"># Prepare training documents, which are labeled.</span>
 <span class="n">training</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([</span>
-    <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s">&quot;a b c d e spark&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">&quot;b d&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s">&quot;spark f g h&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s">&quot;hadoop mapreduce&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s">&quot;b spark who&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="s">&quot;g d a y&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="s">&quot;spark fly&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="s">&quot;was mapreduce&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="s">&quot;e spark program&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="s">&quot;a e c l&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="s">&quot;spark compile&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">11</span><span class="p">,</span> <span class="s">&quot;hadoop software&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">)</span>
-<span class="p">],</span> <span class="p">[</span><span class="s">&quot;id&quot;</span><span class="p">,</span> <span class="s">&quot;text&quot;</span><span class="p">,</span> <span class="s">&quot;label&quot;</span><span class="p">])</span>
-
-<span class="c"># Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.</span>
-<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">(</span><span class="n">inputCol</span><span class="o">=</span><span class="s">&quot;text&quot;</span><span class="p">,</span> <span class="n">outputCol</span><span class="o">=</span><span class="s">&quot;words&quot;</span><span class="p">)</span>
-<span class="n">hashingTF</span> <span class="o">=</span> <span class="n">HashingTF</span><span class="p">(</span><span class="n">inputCol</span><span class="o">=</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">getOutputCol</span><span class="p">(),</span> <span class="n">outputCol</span><span class="o">=</span><span class="s">&quot;features&quot;</span><span class="p">)</span>
+    <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s2">&quot;a b c d e spark&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s2">&quot;b d&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s2">&quot;spark f g h&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&quot;hadoop mapreduce&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s2">&quot;b spark who&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="s2">&quot;g d a y&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="s2">&quot;spark fly&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="s2">&quot;was mapreduce&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="s2">&quot;e spark program&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="s2">&quot;a e c l&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="s2">&quot;spark compile&quot;</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">11</span><span class="p">,</span> <span class="s2">&quot;hadoop software&quot;</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">)</span>
+<span class="p">],</span> <span class="p">[</span><span class="s2">&quot;id&quot;</span><span class="p">,</span> <span class="s2">&quot;text&quot;</span><span class="p">,</span> <span class="s2">&quot;label&quot;</span><span class="p">])</span>
+
+<span class="c1"># Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.</span>
+<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">(</span><span class="n">inputCol</span><span class="o">=</span><span class="s2">&quot;text&quot;</span><span class="p">,</span> <span class="n">outputCol</span><span class="o">=</span><span class="s2">&quot;words&quot;</span><span class="p">)</span>
+<span class="n">hashingTF</span> <span class="o">=</span> <span class="n">HashingTF</span><span class="p">(</span><span class="n">inputCol</span><span class="o">=</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">getOutputCol</span><span class="p">(),</span> <span class="n">outputCol</span><span class="o">=</span><span class="s2">&quot;features&quot;</span><span class="p">)</span>
 <span class="n">lr</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
 <span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span><span class="n">stages</span><span class="o">=</span><span class="p">[</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">hashingTF</span><span class="p">,</span> <span class="n">lr</span><span class="p">])</span>
 
-<span class="c"># We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.</span>
-<span class="c"># This will allow us to jointly choose parameters for all Pipeline stages.</span>
-<span class="c"># A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.</span>
-<span class="c"># We use a ParamGridBuilder to construct a grid of parameters to search over.</span>
-<span class="c"># With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,</span>
-<span class="c"># this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.</span>
+<span class="c1"># We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.</span>
+<span class="c1"># This will allow us to jointly choose parameters for all Pipeline stages.</span>
+<span class="c1"># A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.</span>
+<span class="c1"># We use a ParamGridBuilder to construct a grid of parameters to search over.</span>
+<span class="c1"># With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,</span>
+<span class="c1"># this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.</span>
 <span class="n">paramGrid</span> <span class="o">=</span> <span class="n">ParamGridBuilder</span><span class="p">()</span> \
     <span class="o">.</span><span class="n">addGrid</span><span class="p">(</span><span class="n">hashingTF</span><span class="o">.</span><span class="n">numFeatures</span><span class="p">,</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">])</span> \
     <span class="o">.</span><span class="n">addGrid</span><span class="p">(</span><span class="n">lr</span><span class="o">.</span><span class="n">regParam</span><span class="p">,</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">])</span> \
@@ -605,22 +605,22 @@ Refer to the [`CrossValidator` Python docs](api/python/pyspark.ml.html#pyspark.m
 <span class="n">crossval</span> <span class="o">=</span> <span class="n">CrossValidator</span><span class="p">(</span><span class="n">estimator</span><span class="o">=</span><span class="n">pipeline</span><span class="p">,</span>
                           <span class="n">estimatorParamMaps</span><span class="o">=</span><span class="n">paramGrid</span><span class="p">,</span>
                           <span class="n">evaluator</span><span class="o">=</span><span class="n">BinaryClassificationEvaluator</span><span class="p">(),</span>
-                          <span class="n">numFolds</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>  <span class="c"># use 3+ folds in practice</span>
+                          <span class="n">numFolds</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>  <span class="c1"># use 3+ folds in practice</span>
 
-<span class="c"># Run cross-validation, and choose the best set of parameters.</span>
+<span class="c1"># Run cross-validation, and choose the best set of parameters.</span>
 <span class="n">cvModel</span> <span class="o">=</span> <span class="n">crossval</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
 
-<span class="c"># Prepare test documents, which are unlabeled.</span>
+<span class="c1"># Prepare test documents, which are unlabeled.</span>
 <span class="n">test</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([</span>
-    <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s">&quot;spark i j k&quot;</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="s">&quot;l m n&quot;</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="s">&quot;mapreduce spark&quot;</span><span class="p">),</span>
-    <span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="s">&quot;apache hadoop&quot;</span><span class="p">)</span>
-<span class="p">],</span> <span class="p">[</span><span class="s">&quot;id&quot;</span><span class="p">,</span> <span class="s">&quot;text&quot;</span><span class="p">])</span>
+    <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s2">&quot;spark i j k&quot;</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="s2">&quot;l m n&quot;</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="s2">&quot;mapreduce spark&quot;</span><span class="p">),</span>
+    <span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="s2">&quot;apache hadoop&quot;</span><span class="p">)</span>
+<span class="p">],</span> <span class="p">[</span><span class="s2">&quot;id&quot;</span><span class="p">,</span> <span class="s2">&quot;text&quot;</span><span class="p">])</span>
 
-<span class="c"># Make predictions on test documents. cvModel uses the best model found (lrModel).</span>
+<span class="c1"># Make predictions on test documents. cvModel uses the best model found (lrModel).</span>
 <span class="n">prediction</span> <span class="o">=</span> <span class="n">cvModel</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">test</span><span class="p">)</span>
-<span class="n">selected</span> <span class="o">=</span> <span class="n">prediction</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;id&quot;</span><span class="p">,</span> <span class="s">&quot;text&quot;</span><span class="p">,</span> <span class="s">&quot;probability&quot;</span><span class="p">,</span> <span class="s">&quot;prediction&quot;</span><span class="p">)</span>
+<span class="n">selected</span> <span class="o">=</span> <span class="n">prediction</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&quot;id&quot;</span><span class="p">,</span> <span class="s2">&quot;text&quot;</span><span class="p">,</span> <span class="s2">&quot;probability&quot;</span><span class="p">,</span> <span class="s2">&quot;prediction&quot;</span><span class="p">)</span>
 <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">selected</span><span class="o">.</span><span class="n">collect</span><span class="p">():</span>
     <span class="k">print</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
 </pre></div><div><small>Find full example code at "examples/src/main/python/ml/cross_validator.py" in the Spark repo.</small></div>
@@ -649,7 +649,7 @@ It splits the dataset into these two parts using the <code>trainRatio</code> par
 
     <p>Refer to the <a href="api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit"><code>TrainValidationSplit</code> Scala docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="k">import</span> <span class="nn">org.apache.spark.ml.evaluation.RegressionEvaluator</span>
+    <div class="highlight"><pre><span></span><span class="k">import</span> <span class="nn">org.apache.spark.ml.evaluation.RegressionEvaluator</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.ml.regression.LinearRegression</span>
 <span class="k">import</span> <span class="nn">org.apache.spark.ml.tuning.</span><span class="o">{</span><span class="nc">ParamGridBuilder</span><span class="o">,</span> <span class="nc">TrainValidationSplit</span><span class="o">}</span>
 
@@ -694,7 +694,7 @@ It splits the dataset into these two parts using the <code>trainRatio</code> par
 
     <p>Refer to the <a href="api/java/org/apache/spark/ml/tuning/TrainValidationSplit.html"><code>TrainValidationSplit</code> Java docs</a> for details on the API.</p>
 
-    <div class="highlight"><pre><span class="kn">import</span> <span class="nn">org.apache.spark.ml.evaluation.RegressionEvaluator</span><span class="o">;</span>
+    <div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">org.apache.spark.ml.evaluation.RegressionEvaluator</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.ml.param.ParamMap</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.ml.regression.LinearRegression</span><span class="o">;</span>
 <span class="kn">import</span> <span class="nn">org.apache.spark.ml.tuning.ParamGridBuilder</span><span class="o">;</span>
@@ -711,12 +711,12 @@ It splits the dataset into these two parts using the <code>trainRatio</code> par
 <span class="n">Dataset</span><span class="o">&lt;</span><span class="n">Row</span><span class="o">&gt;</span> <span class="n">training</span> <span class="o">=</span> <span class="n">splits</span><span class="o">[</span><span class="mi">0</span><span class="o">];</span>
 <span class="n">Dataset</span><span class="o">&lt;</span><span class="n">Row</span><span class="o">&gt;</span> <span class="n">test</span> <span class="o">=</span> <span class="n">splits</span><span class="o">[</span><span class="mi">1</span><span class="o">];</span>
 
-<span class="n">LinearRegression</span> <span class="n">lr</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">LinearRegression</span><span class="o">();</span>
+<span class="n">LinearRegression</span> <span class="n">lr</span> <span class="o">=</span> <span class="k">new</span> <span class="n">LinearRegression</span><span class="o">();</span>
 
 <span class="c1">// We use a ParamGridBuilder to construct a grid of parameters to search over.</span>
 <span class="c1">// TrainValidationSplit will try all combinations of values and determine best model using</span>
 <span class="c1">// the evaluator.</span>
-<span class="n">ParamMap</span><span class="o">[]</span> <span class="n">paramGrid</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">ParamGridBuilder</span><span class="o">()</span>
+<span class="n">ParamMap</span><span class="o">[]</span> <span class="n">paramGrid</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ParamGridBuilder</span><span class="o">()</span>
   <span class="o">.</span><span class="na">addGrid</span><span class="o">(</span><span class="n">lr</span><span class="o">.</span><span class="na">regParam</span><span class="o">(),</span> <span class="k">new</span> <span class="kt">double</span><span class="o">[]</span> <span class="o">{</span><span class="mf">0.1</span><span class="o">,</span> <span class="mf">0.01</span><span class="o">})</span>
   <span class="o">.</span><span class="na">addGrid</span><span class="o">(</span><span class="n">lr</span><span class="o">.</span><span class="na">fitIntercept</span><span class="o">())</span>
   <span class="o">.</span><span class="na">addGrid</span><span class="o">(</span><span class="n">lr</span><span class="o">.</span><span class="na">elasticNetParam</span><span class="o">(),</span> <span class="k">new</span> <span class="kt">double</span><span class="o">[]</span> <span class="o">{</span><span class="mf">0.0</span><span class="o">,</span> <span class="mf">0.5</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">})</span>
@@ -724,9 +724,9 @@ It splits the dataset into these two parts using the <code>trainRatio</code> par
 
 <span class="c1">// In this case the estimator is simply the linear regression.</span>
 <span class="c1">// A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.</span>
-<span class="n">TrainValidationSplit</span> <span class="n">trainValidationSplit</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">TrainValidationSplit</span><span class="o">()</span>
+<span class="n">TrainValidationSplit</span> <span class="n">trainValidationSplit</span> <span class="o">=</span> <span class="k">new</span> <span class="n">TrainValidationSplit</span><span class="o">()</span>
   <span class="o">.</span><span class="na">setEstimator</span><span class="o">(</span><span class="n">lr</span><span class="o">)</span>
-  <span class="o">.</span><span class="na">setEvaluator</span><span class="o">(</span><span class="k">new</span> <span class="nf">RegressionEvaluator</span><span class="o">())</span>
+  <span class="o">.</span><span class="na">setEvaluator</span><span class="o">(</span><span class="k">new</span> <span class="n">RegressionEvaluator</span><span class="o">())</span>
   <span class="o">.</span><span class="na">setEstimatorParamMaps</span><span class="o">(</span><span class="n">paramGrid</span><span class="o">)</span>
   <span class="o">.</span><span class="na">setTrainRatio</span><span class="o">(</span><span class="mf">0.8</span><span class="o">);</span>  <span class="c1">// 80% for training and the remaining 20% for validation</span>
 
@@ -746,41 +746,41 @@ It splits the dataset into these two parts using the <code>trainRatio</code> par
 
 Refer to the [`TrainValidationSplit` Python docs](api/python/pyspark.ml.html#pyspark.ml.tuning.TrainValidationSplit) for more details on the API.
 
-<div class="highlight"><pre><span class="kn">from</span> <span class="nn">pyspark.ml.evaluation</span> <span class="kn">import</span> <span class="n">RegressionEvaluator</span>
+<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.ml.evaluation</span> <span class="kn">import</span> <span class="n">RegressionEvaluator</span>
 <span class="kn">from</span> <span class="nn">pyspark.ml.regression</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
 <span class="kn">from</span> <span class="nn">pyspark.ml.tuning</span> <span class="kn">import</span> <span class="n">ParamGridBuilder</span><span class="p">,</span> <span class="n">TrainValidationSplit</span>
 
-<span class="c"># Prepare training and test data.</span>
-<span class="n">data</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">&quot;libsvm&quot;</span><span class="p">)</span>\
-    <span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">&quot;data/mllib/sample_linear_regression_data.txt&quot;</span><span class="p">)</span>
+<span class="c1"># Prepare training and test data.</span>
+<span class="n">data</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s2">&quot;libsvm&quot;</span><span class="p">)</span>\
+    <span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">&quot;data/mllib/sample_linear_regression_data.txt&quot;</span><span class="p">)</span>
 <span class="n">train</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">],</span> <span class="n">seed</span><span class="o">=</span><span class="mi">12345</span><span class="p">)</span>
 
 <span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">maxIter</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
 
-<span class="c"># We use a ParamGridBuilder to construct a grid of parameters to search over.</span>
-<span class="c"># TrainValidationSplit will try all combinations of values and determine best model using</span>
-<span class="c"># the evaluator.</span>
+<span class="c1"># We use a ParamGridBuilder to construct a grid of parameters to search over.</span>
+<span class="c1"># TrainValidationSplit will try all combinations of values and determine best model using</span>
+<span class="c1"># the evaluator.</span>
 <span class="n">paramGrid</span> <span class="o">=</span> <span class="n">ParamGridBuilder</span><span class="p">()</span>\
     <span class="o">.</span><span class="n">addGrid</span><span class="p">(</span><span class="n">lr</span><span class="o">.</span><span class="n">regParam</span><span class="p">,</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">])</span> \
     <span class="o">.</span><span class="n">addGrid</span><span class="p">(</span><span class="n">lr</span><span class="o">.</span><span class="n">fitIntercept</span><span class="p">,</span> <span class="p">[</span><span class="bp">False</span><span class="p">,</span> <span class="bp">True</span><span class="p">])</span>\
     <span class="o">.</span><span class="n">addGrid</span><span class="p">(</span><span class="n">lr</span><span class="o">.</span><span class="n">elasticNetParam</span><span class="p">,</span> <span class="p">[</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">])</span>\
     <span class="o">.</span><span class="n">build</span><span class="p">()</span>
 
-<span class="c"># In this case the estimator is simply the linear regression.</span>
-<span class="c"># A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.</span>
+<span class="c1"># In this case the estimator is simply the linear regression.</span>
+<span class="c1"># A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.</span>
 <span class="n">tvs</span> <span class="o">=</span> <span class="n">TrainValidationSplit</span><span class="p">(</span><span class="n">estimator</span><span class="o">=</span><span class="n">lr</span><span class="p">,</span>
                            <span class="n">estimatorParamMaps</span><span class="o">=</span><span class="n">paramGrid</span><span class="p">,</span>
                            <span class="n">evaluator</span><span class="o">=</span><span class="n">RegressionEvaluator</span><span class="p">(),</span>
-                           <span class="c"># 80% of the data will be used for training, 20% for validation.</span>
+                           <span class="c1"># 80% of the data will be used for training, 20% for validation.</span>
                            <span class="n">trainRatio</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
 
-<span class="c"># Run TrainValidationSplit, and choose the best set of parameters.</span>
+<span class="c1"># Run TrainValidationSplit, and choose the best set of parameters.</span>
 <span class="n">model</span> <span class="o">=</span> <span class="n">tvs</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
 
-<span class="c"># Make predictions on test data. model is the model with combination of parameters</span>
-<span class="c"># that performed best.</span>
+<span class="c1"># Make predictions on test data. model is the model with combination of parameters</span>
+<span class="c1"># that performed best.</span>
 <span class="n">model</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">test</span><span class="p">)</span>\
-    <span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;features&quot;</span><span class="p">,</span> <span class="s">&quot;label&quot;</span><span class="p">,</span> <span class="s">&quot;prediction&quot;</span><span class="p">)</span>\
+    <span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">&quot;features&quot;</span><span class="p">,</span> <span class="s2">&quot;label&quot;</span><span class="p">,</span> <span class="s2">&quot;prediction&quot;</span><span class="p">)</span>\
     <span class="o">.</span><span class="n">show</span><span class="p">()</span>
 </pre></div><div><small>Find full example code at "examples/src/main/python/ml/train_validation_split.py" in the Spark repo.</small></div>
 </div>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message