spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <>
Subject Re: Spark Implementation of XGBoost
Date Tue, 27 Oct 2015 08:02:34 GMT
Hi Meihua,

For categorical features, the ordinal issue can be solved by trying
all kind of different partitions 2^(q-1) -1 for q values into two
groups. However, it's computational expensive. In Hastie's book, in
9.2.4, the trees can be trained by sorting the residuals and being
learnt as if they are ordered. It can be proven that it will give the
optimal solution. I have a proof that this works for learning
regression trees through variance reduction.

I'm also interested in understanding how the L1 and L2 regularization
within the boosting works (and if it helps with overfitting more than



DB Tsai
PGP Key ID: 0xAF08DF8D

On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu <> wrote:
> Hi DB Tsai,
> Thank you very much for your interest and comment.
> 1) feature sub-sample is per-node, like random forest.
> 2) The current code heavily exploits the tree structure to speed up
> the learning (such as processing multiple learning node in one pass of
> the training data). So a generic GBM is likely to be a different
> codebase. Do you have any nice reference of efficient GBM? I am more
> than happy to look into that.
> 3) The algorithm accept training data as a DataFrame with the
> featureCol indexed by VectorIndexer. You can specify which variable is
> categorical in the VectorIndexer. Please note that currently all
> categorical variables are treated as ordered. If you want some
> categorical variables as unordered, you can pass the data through
> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> unordered categorical variable using the approach in RF in Spark ML
> (Please see roadmap in the
> Thanks,
> Meihua
> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <> wrote:
>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>> you think you can implement generic GBM and have it merged as part of
>> Spark codebase?
>> Sincerely,
>> DB Tsai
>> ----------------------------------------------------------
>> Web:
>> PGP Key ID: 0xAF08DF8D
>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>> <> wrote:
>>> Hi Spark User/Dev,
>>> Inspired by the success of XGBoost, I have created a Spark package for
>>> gradient boosting tree with 2nd order approximation of arbitrary
>>> user-defined loss functions.
>>> Currently linear (normal) regression, binary classification, Poisson
>>> regression are supported. You can extend with other loss function as
>>> well.
>>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>> Thank you for testing. I am looking forward to your comments and
>>> suggestions. Bugs or improvements can be reported through GitHub.
>>> Many thanks!
>>> Meihua
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message