From Yahoo_SK <>
Subject Difference in AUCs b/w Spark's GBT and sklearn's
Date Mon, 21 Dec 2015 16:17:54 GMT

I tried GBDTs both with Python's sklearn as well as Spark's local stand-alone MLlib implementation
with default settings for a binary classification problem. I kept the numIterations, loss
function same in both the cases. The features are all real valued and continuous. However,
the AUC in MLLib implementation was way off compared to sklearn's. These were the parameters
for sklearn's classifier:

    init=None, learning_rate=0.001, loss='deviance',max_depth=8,
    max_features=None, max_leaf_nodes=None, min_samples_leaf=1, 
    min_samples_split=2, min_weight_fraction_leaf=0.0, 
    n_estimators=100, random_state=None, subsample=1.0, 
    verbose=0, warm_start=False) 
I wanted to check if there's a way to figure and set these params in MLlib or if MLlib also
assumes same settings (which are pretty standard).

Any pointers to figure the difference would be helpful.

