spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
Date Fri, 25 Aug 2017 20:52:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-21806:
------------------------------------

    Assignee: Apache Spark

> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> ----------------------------------------------------------------------
>
>                 Key: SPARK-21806
>                 URL: https://issues.apache.org/jira/browse/SPARK-21806
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.2.0
>            Reporter: Marc Kaminski
>            Assignee: Apache Spark
>            Priority: Minor
>         Attachments: PRROC_example.jpeg
>
>
> I would like to reference to a [discussion in scikit-learn| https://github.com/scikit-learn/scikit-learn/issues/4223],
as this behavior is probably based on the scikit implementation. 
> Summary: 
> Currently, the y-axis intercept of the precision recall curve is set to (0.0, 1.0). This
behavior is not ideal in certain edge cases (see example below) and can also have an impact
on cross validation, when optimization metric is set to "areaUnderPR". 
> Please consider [blucena's post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
for possible alternatives. 
> Edge case example: 
> Consider a bad classifier, that assigns a high probability to all samples. A possible
output might look like this: 
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default): 
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 | 
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated probability assignment
 will be falsely assumed to perform worse in regard to this auPRC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message