spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sethah <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-6129][MLLIB][DOCS] Added user guide for...
Date Mon, 27 Jul 2015 21:11:30 GMT
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7655#discussion_r35587655
  
    --- Diff: docs/mllib-metrics.md ---
    @@ -0,0 +1,1464 @@
    +---
    +layout: global
    +title: Evaluation Metrics - MLlib
    +displayTitle: <a href="mllib-guide.html">MLlib</a> - Evaluation Metrics
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +
    +## Algorithm Metrics
    +
    +Spark's MLlib comes with a number of machine learning algorithms that can be used to
learn from and make predictions
    +on data. When applying these algorithms, there is a need to evaluate their performance
on certain criteria, depending
    +on the application and its requirements. Spark's MLlib also provides a suite of metrics
for the purpose of evaluating the
    +performance of its algorithms.
    +
    +Specific machine learning algorithms fall under broader types of machine learning applications
like classification,
    +regression, clustering, etc. Each of these types have well established metrics for performance
evaluation and those
    +metrics that are currently available in Spark's MLlib are detailed in this section.
    +
    +## Binary Classification
    +
    +[Binary classifiers](https://en.wikipedia.org/wiki/Binary_classification) are used to
separate the elements of a given
    +dataset into one of two possible groups (e.g. fraud or not fraud) and is a special case
of multiclass classification.
    +Most binary classification metrics can be generalized to multiclass classification metrics.
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>Metric</th><th>Definition</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Precision (Postive Predictive Value)</td>
    +      <td>$PPV=\frac{TP}{TP + FP}$</td>
    +    </tr>
    +    <tr>
    +      <td>Recall (True Positive Rate)</td>
    +      <td>$TPR=\frac{TP}{P}=\frac{TP}{TP + FN}$</td>
    +    </tr>
    +    <tr>
    +      <td>F-measure</td>
    +      <td>$F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot TPR}
    +          {\beta^2 \cdot PPV + TPR}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Receiver Operating Characteristic (ROC)</td>
    +      <td>$FPR(T)=\int^\infty_{T} P_0(T)\,dT \\ TPR(T)=\int^\infty_{T} P_1(T)\,dT$</td>
    +    </tr>
    +    <tr>
    +      <td>Area Under ROC Curve</td>
    +      <td>$AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Area Under Precision-Recall Curve</td>
    +      <td>$AUPRC=\int^1_{0} \frac{TP}{TP+FP} d\left(\frac{TP}{P}\right)$</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +
    +**Examples**
    +
    +<div class="codetabs">
    +The following code snippets illustrate how to load a sample dataset, train a binary classification
algorithm on the
    +data, and evaluate the performance of the algorithm by several binary evaluation metrics.
    +
    +<div data-lang="scala" markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
    +import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLUtils
    +
    +// Load training data in LIBSVM format
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")
    +
    +// Split data into training (60%) and test (40%)
    +val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
    +val training = splits(0).cache()
    +val test = splits(1)
    +
    +// Run training algorithm to build the model
    +val model = new LogisticRegressionWithLBFGS()
    +  .setNumClasses(2)
    +  .run(training)
    +
    +// Clear the prediction threshold so the model will return probabilities
    +model.clearThreshold
    +
    +// Compute raw scores on the test set
    +val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
    +  val prediction = model.predict(features)
    +  (prediction, label)
    +}
    +
    +// Instantiate metrics object
    +val metrics = new BinaryClassificationMetrics(predictionAndLabels)
    +
    +// Precision by threshold
    +val precision = metrics.precisionByThreshold
    +precision.foreach(x => printf("Threshold: %1.2f, Precision: %1.2f\n", x._1, x._2))
    +
    +// Recall by threshold
    +val recall = metrics.precisionByThreshold
    +recall.foreach(x => printf("Threshold: %1.2f, Recall: %1.2f\n", x._1, x._2))
    +
    +// Precision-Recall Curve
    +val PRC = metrics.pr
    +
    +// F-measure
    +val f1Score = metrics.fMeasureByThreshold
    +f1Score.foreach(x => printf("Threshold: %1.2f, F-score: %1.2f, Beta = 1\n", x._1,
x._2))
    +
    +val beta = 0.5
    +val fScore = metrics.fMeasureByThreshold(beta)
    +fScore.foreach(x => printf("Threshold: %1.2f, F-score: %1.2f, Beta = 0.5\n", x._1,
x._2))
    +
    +// AUPRC
    +val auPRC = metrics.areaUnderPR
    +println("Area under precision-recall curve = " + auPRC)
    +
    +// Compute thresholds used in ROC and PR curves
    +val thresholds = precision.map(_._1)
    +
    +// ROC Curve
    +val roc = metrics.roc
    +
    +// AUROC
    +val auROC = metrics.areaUnderROC
    +println("Area under ROC = " + auROC)
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +{% highlight java %}
    +import scala.Tuple2;
    +
    +import org.apache.spark.api.java.*;
    +import org.apache.spark.rdd.RDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.mllib.classification.LogisticRegressionModel;
    +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS;
    +import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.SparkContext;
    +
    +public class BinaryClassification {
    +  public static void main(String[] args) {
    +    SparkConf conf = new SparkConf().setAppName("Binary Classification Metrics");
    +    SparkContext sc = new SparkContext(conf);
    +    String path = "data/mllib/sample_binary_classification_data.txt";
    +    JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
    +
    +    // Split initial RDD into two... [60% training data, 40% testing data].
    +    JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[] {0.6, 0.4},
11L);
    +    JavaRDD<LabeledPoint> training = splits[0].cache();
    +    JavaRDD<LabeledPoint> test = splits[1];
    +
    +    // Run training algorithm to build the model.
    +    final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
    +      .setNumClasses(3)
    +      .run(training.rdd());
    +
    +    // Compute raw scores on the test set.
    +    JavaRDD<Tuple2<Object, Object>> predictionAndLabels = test.map(
    +      new Function<LabeledPoint, Tuple2<Object, Object>>() {
    +        public Tuple2<Object, Object> call(LabeledPoint p) {
    +          Double prediction = model.predict(p.features());
    +          return new Tuple2<Object, Object>(prediction, p.label());
    +        }
    +      }
    +    );
    +
    +    // Get evaluation metrics.
    +    BinaryClassificationMetrics metrics = new BinaryClassificationMetrics(predictionAndLabels.rdd());
    +
    +    // Precision by threshold
    +    JavaRDD<Tuple2<Object, Object>> precision = metrics.precisionByThreshold().toJavaRDD();
    +    System.out.println("Precision by threshold: " + precision.toArray());
    +
    +    // Recall by threshold
    +    JavaRDD<Tuple2<Object, Object>> recall = metrics.recallByThreshold().toJavaRDD();
    +    System.out.println("Recall by threshold: " + recall.toArray());
    +
    +    // F Score by threshold
    +    JavaRDD<Tuple2<Object, Object>> f1Score = metrics.fMeasureByThreshold().toJavaRDD();
    +    System.out.println("F1 Score by threshold: " + f1Score.toArray());
    +
    +    JavaRDD<Tuple2<Object, Object>> f2Score = metrics.fMeasureByThreshold(2.0).toJavaRDD();
    +    System.out.println("F2 Score by threshold: " + f2Score.toArray());
    +
    +    // Precision-recall curve
    +    JavaRDD<Tuple2<Object, Object>> prc = metrics.pr().toJavaRDD();
    +    System.out.println("Precision-recall curve: " + prc.toArray());
    +
    +    // Thresholds
    +    JavaRDD<Double> thresholds = precision.map(
    +      new Function<Tuple2<Object, Object>, Double>() {
    +        public Double call (Tuple2<Object, Object> t) {
    +          return new Double(t._1().toString());
    +        }
    +      }
    +    );
    +
    +    // ROC Curve
    +    JavaRDD<Tuple2<Object, Object>> roc = metrics.roc().toJavaRDD();
    +    System.out.println("ROC curve: " + roc.toArray());
    +
    +    // AUPRC
    +    System.out.println("Area under precision-recall curve = " + metrics.areaUnderPR());
    +
    +    // AUROC
    +    System.out.println("Area under ROC = " + metrics.areaUnderROC());
    +
    +    // Save and load model
    +    model.save(sc, "myModelPath");
    +    LogisticRegressionModel sameModel = LogisticRegressionModel.load(sc, "myModelPath");
    +  }
    +}
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +{% highlight python %}
    +from pyspark.mllib.classification import LogisticRegressionWithLBFGS
    +from pyspark.mllib.evaluation import BinaryClassificationMetrics
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.util import MLUtils
    +
    +# Several of the methods available in scala are currently missing from pyspark
    +
    +# Load training data in LIBSVM format
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")
    +
    +# Split data into training (60%) and test (40%)
    +splits = data.randomSplit([0.6, 0.4], seed = 11L)
    +training = splits[0].cache()
    +test = splits[1]
    +
    +# Run training algorithm to build the model
    +model = LogisticRegressionWithLBFGS.train(training)
    +
    +# Compute raw scores on the test set
    +predictionAndLabels = test.map(lambda lp: (float(model.predict(lp.features)), lp.label))
    +
    +# Instantiate metrics object
    +metrics = BinaryClassificationMetrics(predictionAndLabels)
    +
    +# Area under precision-recall curve
    +print "Area under PR = %1.2f" % metrics.areaUnderPR
    +
    +# Area under ROC curve
    +print "Area under ROC = %1.2f" % metrics.areaUnderROC
    +
    +{% endhighlight %}
    +
    +</div>
    +</div>
    +
    +
    +## Multiclass Classification
    +
    +A [multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification)
describes a classification
    +problem where there are $M \gt 2$ possible labels for each data point (the case where
$M=2$ is the binary
    +classification problem). For example, classifying handwriting samples to the digits 0
to 9, having 10 possible classes.
    +
    +Define the class, or label, set as
    +
    +$$L = \{\ell_0, \ell_1, \ldots, \ell_{M-1} \} $$
    +
    +The true output vector $\mathbf{y}$ consists of $N$ elements
    +
    +$$\mathbf{y}_0, \mathbf{y}_1, \ldots, \mathbf{y}_{N-1} \in L $$
    +
    +A multiclass prediction algorithm generates a prediction vector $\hat{\mathbf{y}}$ of
$N$ elements
    +
    +$$\hat{\mathbf{y}}_0, \hat{\mathbf{y}}_1, \ldots, \hat{\mathbf{y}}_{N-1} \in L $$
    +
    +For this section, a modified delta function $\hat{\delta}(x)$ will prove useful
    +
    +$$\hat{\delta}(x) = \begin{cases}1 & \text{if $x = 0$}, \\ 0 & \text{otherwise}.\end{cases}$$
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>Metric</th><th>Definition</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Confusion Matrix</td>
    +      <td>
    +        $C_{ij} = \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_i) \cdot \hat{\delta}(\hat{\mathbf{y}}_k
- \ell_j)\\ \\
    +         \left( \begin{array}{ccc}
    +         \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k
- \ell_1) & \ldots &
    +         \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k
- \ell_N) \\
    +         \vdots & \ddots & \vdots \\
    +         \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k
- \ell_1) & \ldots &
    +         \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k
- \ell_N)
    +         \end{array} \right)$
    +      </td>
    +    </tr>
    +    <tr>
    +      <td>Overall Precision</td>
    +      <td>$PPV = \frac{TP}{TP + FP} = \frac{1}{N}\sum_{i=0}^{N-1} \hat{\delta}\left(\hat{\mathbf{y}}_i
-
    +        \mathbf{y}_i\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Overall Recall</td>
    +      <td>$TPR = \frac{TP}{TP + FN} = \frac{1}{N}\sum_{i=0}^{N-1} \hat{\delta}\left(\hat{\mathbf{y}}_i
-
    +        \mathbf{y}_i\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Overall F1-measure</td>
    +      <td>$F1 = 2 \cdot \left(\frac{PPV \cdot TPR}
    +          {PPV + TPR}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Precision by label</td>
    +      <td>$PPV(\ell) = \frac{TP}{TP + FP} =
    +          \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i
- \ell)}
    +          {\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell)}$</td>
    +    </tr>
    +    <tr>
    +      <td>Recall by label</td>
    +      <td>$TPR(\ell)=\frac{TP}{P} =
    +          \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i
- \ell)}
    +          {\sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i - \ell)}$</td>
    +    </tr>
    +    <tr>
    +      <td>F-measure by label</td>
    +      <td>$F(\beta, \ell) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV(\ell)
\cdot TPR(\ell)}
    +          {\beta^2 \cdot PPV(\ell) + TPR(\ell)}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Weighted precision</td>
    +      <td>$PPV_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} PPV(\ell)
    +          \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
    +    </tr>
    +    <tr>
    +      <td>Weighted recall</td>
    +      <td>$TPR_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} TPR(\ell)
    +          \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
    +    </tr>
    +    <tr>
    +      <td>Weighted F-measure</td>
    +      <td>$F_{w}(\beta)= \frac{1}{N} \sum\nolimits_{\ell \in L} F(\beta, \ell)
    +          \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +**Examples**
    +
    +<div class="codetabs">
    +The following code snippets illustrate how to load a sample dataset, train a multiclass
classification algorithm on
    +the data, and evaluate the performance of the algorithm by several multiclass classification
evaluation metrics.
    +
    +<div data-lang="scala" markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
    +import org.apache.spark.mllib.evaluation.MulticlassMetrics
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLUtils
    +
    +// Load training data in LIBSVM format
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt")
    +
    +// Split data into training (60%) and test (40%)
    +val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
    +val training = splits(0).cache()
    +val test = splits(1)
    +
    +// Run training algorithm to build the model
    +val model = new LogisticRegressionWithLBFGS()
    +  .setNumClasses(3)
    +  .run(training)
    +
    +// Compute raw scores on the test set
    +val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
    +  val prediction = model.predict(features)
    +  (prediction, label)
    +}
    +
    +// Instantiate metrics object
    +val metrics = new MulticlassMetrics(predictionAndLabels)
    +
    +// Confusion matrix
    +println("Confusion matrix:")
    +println(metrics.confusionMatrix)
    +
    +// Overall Statistics
    +val precision = metrics.precision
    +val recall = metrics.recall // same as true positive rate
    +val f1Score = metrics.fMeasure
    +println("Summary Statistics")
    +printf("Precision = %1.2f\n", precision)
    +printf("Recall = %1.2f\n", recall)
    +printf("F1 Score = %1.2f\n", f1Score)
    +
    +// Precision by label
    +val labels = metrics.labels
    +labels.foreach(l => printf("Precision(%s): %1.2f\n", l, metrics.precision(l)))
    +
    +// Recall by label
    +labels.foreach(l => printf("Recall(%s): %1.2f\n", l, metrics.recall(l)))
    +
    +// False positive rate by label
    +labels.foreach(l => printf("FPR(%s): %1.2f\n", l, metrics.falsePositiveRate(l)))
    +
    +// F-measure by label
    +labels.foreach(l => printf("F1 Score(%s): %1.2f\n", l, metrics.fMeasure(l)))
    +
    +// Weighted stats
    +printf("Weighted precision: %1.2f\n", metrics.weightedPrecision)
    +printf("Weighted recall: %1.2f\n", metrics.weightedRecall)
    +printf("Weighted F1 score: %1.2f\n", metrics.weightedFMeasure)
    +printf("Weighted false positive rate: %1.2f\n", metrics.weightedFalsePositiveRate)
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +{% highlight java %}
    +import scala.Tuple2;
    +
    +import org.apache.spark.api.java.*;
    +import org.apache.spark.rdd.RDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.mllib.classification.LogisticRegressionModel;
    +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS;
    +import org.apache.spark.mllib.evaluation.MulticlassMetrics;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.mllib.linalg.Matrix;
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.SparkContext;
    +
    +public class MulticlassClassification {
    +  public static void main(String[] args) {
    +    SparkConf conf = new SparkConf().setAppName("Multiclass Classification Metrics");
    +    SparkContext sc = new SparkContext(conf);
    +    String path = "data/mllib/sample_multiclass_classification_data.txt";
    +    JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
    +
    +    // Split initial RDD into two... [60% training data, 40% testing data].
    +    JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[] {0.6, 0.4},
11L);
    +    JavaRDD<LabeledPoint> training = splits[0].cache();
    +    JavaRDD<LabeledPoint> test = splits[1];
    +
    +    // Run training algorithm to build the model.
    +    final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
    +      .setNumClasses(3)
    +      .run(training.rdd());
    +
    +    // Compute raw scores on the test set.
    +    JavaRDD<Tuple2<Object, Object>> predictionAndLabels = test.map(
    +      new Function<LabeledPoint, Tuple2<Object, Object>>() {
    +        public Tuple2<Object, Object> call(LabeledPoint p) {
    +          Double prediction = model.predict(p.features());
    +          return new Tuple2<Object, Object>(prediction, p.label());
    +        }
    +      }
    +    );
    +
    +    // Get evaluation metrics.
    +    MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd());
    +
    +    // Confusion matrix
    +    Matrix confusion = metrics.confusionMatrix();
    +    System.out.println("Confusion matrix: \n" + confusion);
    +
    +    // Overall statistics
    +    System.out.println("Precision = " + metrics.precision());
    +    System.out.println("Recall = " + metrics.recall());
    +    System.out.println("F1 Score = " + metrics.fMeasure());
    +
    +    // Stats by labels
    +    for (int i = 0; i < metrics.labels().length; i++) {
    +        System.out.format("Class %1.2f precision = %1.2f\n", metrics.labels()[i], metrics.precision(metrics.labels()[i]));
    +        System.out.format("Class %1.2f recall = %1.2f\n", metrics.labels()[i], metrics.recall(metrics.labels()[i]));
    +        System.out.format("Class %1.2f F1 score = %1.2f\n", metrics.labels()[i], metrics.fMeasure(metrics.labels()[i]));
    +    }
    +
    +    //Weighted stats
    +    System.out.format("Weighted precision = %1.2f\n", metrics.weightedPrecision());
    +    System.out.format("Weighted recall = %1.2f\n", metrics.weightedRecall());
    +    System.out.format("Weighted F1 score = %1.2f\n", metrics.weightedFMeasure());
    +    System.out.format("Weighted false positive rate = %1.2f\n", metrics.weightedFalsePositiveRate());
    +
    +    // Save and load model
    +    model.save(sc, "myModelPath");
    +    LogisticRegressionModel sameModel = LogisticRegressionModel.load(sc, "myModelPath");
    +  }
    +}
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +
    +{% highlight python %}
    +from pyspark.mllib.classification import LogisticRegressionWithLBFGS
    +from pyspark.mllib.util import MLUtils
    +from pyspark.mllib.evaluation import MulticlassMetrics
    +
    +# Load training data in LIBSVM format
    +data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt")
    +
    +# Split data into training (60%) and test (40%)
    +splits = data.randomSplit([0.6, 0.4], seed = 11L)
    +training = splits[0].cache()
    +test = splits[1]
    +
    +# Run training algorithm to build the model
    +model = LogisticRegressionWithLBFGS.train(training, numClasses=3)
    +
    +# Compute raw scores on the test set
    +predictionAndLabels = test.map(lambda lp: (float(model.predict(lp.features)), lp.label))
    +
    +# Instantiate metrics object
    +metrics = MulticlassMetrics(predictionAndLabels)
    +
    +# Overall statistics
    +precision = metrics.precision()
    +recall = metrics.recall()
    +f1Score = metrics.fMeasure()
    +print "Summary Stats"
    +print "Precision = %1.2f" % precision
    +print "Recall = %1.2f" % recall
    +print "F1 Score = %1.2f" % f1Score
    +
    +# Statistics by class
    +labels = data.map(lambda lp: lp.label).distinct().collect()
    +for label in sorted(labels):
    +    print "Class %s precision = %1.2f" % (label, metrics.precision(label))
    +    print "Class %s recall = %1.2f" % (label, metrics.recall(label))
    +    print "Class %s F1 Measure = %1.2f" % (label, metrics.fMeasure(label, beta=1.0))
    +
    +# Weighted stats
    +print "Weighted recall = %1.2f" % metrics.weightedRecall
    +print "Weighted precision = %1.2f" % metrics.weightedPrecision
    +print "Weighted F(1) Score = %1.2f" % metrics.weightedFMeasure()
    +print "Weighted F(0.5) Score = %1.2f" % metrics.weightedFMeasure(beta=0.5)
    +print "Weighted false positive rate = %1.2f" % metrics.weightedFalsePositiveRate
    +{% endhighlight %}
    +
    +</div>
    +</div>
    +
    +## Multilabel Classification
    +
    +A [multilabel classification](https://en.wikipedia.org/wiki/Multi-label_classification)
problem involves mapping
    +each sample in a dataset to a set of class labels. In this type of classification problem,
the labels are not
    +mutually exclusive. For example, when classifying a set of news articles into topics,
a single article might be both
    +science and politics.
    +
    +Here we define a set $D$ of $N$ documents
    +
    +$$D = \left\{d_0, d_1, ..., d_{N-1}\right\}$$
    +
    +Define $L_0, L_1, ..., L_{N-1}$ to be a family of label sets and $P_0, P_1, ..., P_{N-1}$
    +to be a family of prediction sets where $L_i$ and $P_i$ are the label set and prediction
set, respectively, that
    +correspond to document $d_i$.
    +
    +The set of all unique labels is given by
    +
    +$$L = \bigcup_{k=0}^{N-1} L_k$$
    +
    +The following definition of indicator function $I_A(x)$ on a set $A$ will be necessary
    +
    +$$I_A(x) = \begin{cases}1 & \text{if $x \in A$}, \\ 0 & \text{otherwise}.\end{cases}$$
    +
    +<table class="table">
    +  <thead>
    +    <tr><th>Metric</th><th>Definition</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Precision</td><td>$\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|P_i
\cap L_i\right|}{\left|P_i\right|}$</td>
    +    </tr>
    +    <tr>
    +      <td>Recall</td><td>$\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|L_i
\cap P_i\right|}{\left|L_i\right|}$</td>
    +    </tr>
    +    <tr>
    +      <td>Accuracy</td>
    +      <td>
    +        $\frac{1}{N} \sum_{i=0}^{N - 1} \frac{\left|L_i \cap P_i \right|}
    +        {\left|L_i\right| + \left|P_i\right| - \left|L_i \cap P_i \right|}$
    +      </td>
    +    </tr>
    +    <tr>
    +      <td>Precision by label</td><td>$PPV(\ell)=\frac{TP}{TP + FP}=
    +          \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)}
    +          {\sum_{i=0}^{N-1} I_{P_i}(\ell)}$</td>
    +    </tr>
    +    <tr>
    +      <td>Recall by label</td><td>$TPR(\ell)=\frac{TP}{P}=
    +          \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)}
    +          {\sum_{i=0}^{N-1} I_{L_i}(\ell)}$</td>
    +    </tr>
    +    <tr>
    +      <td>F1-measure by label</td><td>$F1(\ell) = 2
    +                            \cdot \left(\frac{PPV(\ell) \cdot TPR(\ell)}
    +                            {PPV(\ell) + TPR(\ell)}\right)$</td>
    +    </tr>
    +    <tr>
    +      <td>Hamming Loss</td>
    +      <td>
    +        $\frac{1}{N \cdot \left|L\right|} \sum_{i=0}^{N - 1} \left|L_i\right| + \left|P_i\right|
- 2\left|L_i
    +          \cap P_i\right|$
    +      </td>
    +    </tr>
    +    <tr>
    +      <td>Subset Accuracy</td>
    +      <td>$\frac{1}{N} \sum_{i=0}^{N-1} I_{\{L_i\}}(P_i)$</td>
    +    </tr>
    +    <tr>
    +      <td>F1 Measure</td>
    +      <td>$\frac{1}{N} \sum_{i=0}^{N-1} 2 \frac{\left|P_i \cap L_i\right|}{\left|P_i\right|
\cdot \left|L_i\right|}$</td>
    +    </tr>
    +    <tr>
    +      <td>Micro precision</td>
    +      <td>$\frac{TP}{TP + FP}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}
    +          {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|P_i -
L_i\right|}$</td>
    +    </tr>
    +    <tr>
    +      <td>Micro recall</td>
    +      <td>$\frac{TP}{TP + FN}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}
    +        {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right|}$</td>
    +    </tr>
    +    <tr>
    +      <td>Micro F1 Measure</td>
    +      <td>
    +        $2 \cdot \frac{TP}{2 \cdot TP + FP + FN}=2 \cdot \frac{\sum_{i=0}^{N-1} \left|P_i
\cap L_i\right|}{2 \cdot
    +        \sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right|
+ \sum_{i=0}^{N-1}
    +        \left|P_i - L_i\right|}$
    +      </td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +**Examples**
    +
    +<div class="codetabs">
    +The following code snippets illustrate how to evaluate the performance of a multilabel
classifer.
    +
    +<div data-lang="scala" markdown="1">
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.evaluation.MultilabelMetrics
    +import org.apache.spark.rdd.RDD;
    +
    +/**
    --- End diff --
    
    Moved duplicated comments to the markdown text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message