spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Commented] (SPARK-14681) Provide label/impurity stats for decision tree nodes
Date Thu, 08 Mar 2018 19:16:00 GMT


Joseph K. Bradley commented on SPARK-14681:

[~WeichenXu123] Thanks for the PR!  I'll comment on the design here in the JIRA.

>From your PR:
class TreeClassifierStatInfo
   def getLabelCount(label: Int): Double

class TreeRegressorStatInfo
   def getCount(): Double
   def getSum(): Double
   def getSquareSum(): Double

class Node
   +++ def statInfo: TreeStatInfo

trait TreeStatInfo
   def asTreeClassifierStatInfo: TreeClassifierStatInfo
   def asTreeRegressorStatInfo: TreeRegressorStatInfo

I have a few thoughts:
* I like the overall approach of using classes instead of just returning plain double arrays.
* This will require users to explicitly cast TreeStatInfo to the classifier/regressor type.
 Would it be possible to avoid that without breaking APIs, e.g., by having a ClassificationNode
and a RegressionNode inheriting from Node?
* Naming: What about using "Stats" or "Statistics" instead of "StatInfo?"  I just feel the
"Info" part is uninformative.

> Provide label/impurity stats for decision tree nodes
> -------------------------------------------------------------
>                 Key: SPARK-14681
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Priority: Major
> Currently, decision trees provide all node info except for the aggregated stats
about labels and impurities.  This task is to provide those publicly.  We need to choose a
good API for it, so we should discuss the design on this issue before implementing it.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message