spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vlad17 <...@git.apache.org>
Subject [GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost [WIP]
Date Mon, 08 Aug 2016 22:15:01 GMT
GitHub user vlad17 opened a pull request:

    https://github.com/apache/spark/pull/14547

    [SPARK-16718][MLlib] gbm-style treeboost [WIP]

    ## What changes were proposed in this pull request?
    
    This change adds TreeBoost functionality to `GBTClassifer` and `GBTRegressor`. The main
change is that leaf nodes now make a prediction which optimizes the loss function, rather
than always using the mean label (which is only optimal in the case of variance-based impurity).
    
    This changes the defaults to use the loss-based impurity rather than the required variance.
    
    I made this change only for L2 loss and logistic loss (adding some aliases to the names
as well for parity with R's implementation, GBM). These two functions have leaf predictions
that can be computed within the framework of the current impurity API. Other loss functions
will require API modification, which should be its own change, SPARK-16728.
    
    Note that because loss-based impurity with L1 loss is NOT supported, code that only sets
default impurity and L1 loss will now throw (impurity should be variance, explicitly).
    
    ## How was this patch tested?
    
    Unit testing for correctness: I tested defaults parameter values and new settings for
the parameters.
    
    [WIP] For accuracy, I'm currently comparing the performance on a [real-life dataset](https://www.datarobot.com/blog/r-getting-started-with-data-science/)
between Spark and GBM. I will upload the results once I have them.
    [WIP] This code shouldn't introduce any regressions, but it would be nice to make sure.
I'm waiting for @sethah to respond on [his previous PR](https://github.com/apache/spark/commit/dafd70fbfe70702502ef198f2a8f529ef7557592)
so that he can make his benchmarking script available to me.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vlad17/spark GBT-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14547.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14547
    
----
commit 6c7c60b581464be13b44aa43d2c402501fdb0505
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-22T01:01:58Z

    Added new documentation for TreeBoost, top-level calls

commit a4c050675bc524b742cb9fc3703ce5105cabdd8a
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-22T19:55:10Z

    Implemented ApproxBernoulliImpurity

commit 5a38e0c1b284423f3129c4edbacece562fb675a3
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-25T22:59:19Z

    Added approximate Bernoulli impurity (L_2 treeboost)

commit 759d1aa1a20c1679fba212c3017e200d386fa6da
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-26T00:21:22Z

    Added marker saying Laplace Impurity is not yet supported (requires internal API change)

commit e027d6dedd928e96dc7c99dc699d9f7c374034a3
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-26T00:26:29Z

    Updated docs to reflect lack of L1 impurity support

commit 15575a13c0ad4f2567bcccdcbcb134a9ca548d9c
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-26T00:41:00Z

    Fixed urls

commit 7c7d804dc3c614984e863aae9ef8ffc8f9ec3117
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-26T00:43:46Z

    Removed ApproxLaplaceImpurity

commit 44a58efe4b0b1bd69eaadc5dc17676194b949888
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-26T00:50:50Z

    Fix reader docs

commit b362c3852c0e17783b08a9c9a97e1abb66ef5c9f
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-26T23:43:41Z

    Fixed a bunch of bugs + tested wrt old behavior

commit f31903c228c164313c2f0cb22fac8b81effff6a1
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-27T00:47:51Z

    Completed tests for reading/writing new impurities

commit 01eae2ae967fdbe89b0ecd440216e54431d51d3d
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-07-27T17:15:05Z

    Finished tests

commit bd189e2aae27266314b16f0dffc3ce7a230d4e27
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-08-06T23:16:18Z

    Added R's gbm as a direct comparison to GBTClassifier

commit 704864354619581f1f5bb43489c5e2ee9ec89487
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-08-07T00:20:35Z

    Got rid of direct R comparison

commit a0a8fcddefa122682c579b567524cbcf2b00251c
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-08-08T06:18:14Z

    Direct behavior-checking test (for GBTClassifier)

commit c050586e7db6eed41f5b8ddf1e245b13be2c8994
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-08-08T20:44:42Z

    Added analogous test for GBTReressor

commit 7e39ada3acf431c171adfca0603279002ff20153
Author: Vladimir Feinberg <vf@databricks.com>
Date:   2016-08-08T21:03:47Z

    Cleaned up style-related things

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message