spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meihua Wu <>
Subject Re: Spark Implementation of XGBoost
Date Wed, 28 Oct 2015 00:04:47 GMT
Hi DB Tsai,

Thank you again for your insightful comments!

1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?

2) For L2/L1 regularization vs Learning rate (I use this name instead
shrinkage to avoid confusion), I have the following observations:

Suppose G and H are the sum (over the data assigned to a leaf node) of
the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
Then for this leaf node,

* With a learning rate eta, f_{m+1} = f_m - G/H*eta

* With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)

If H>0 (convex loss), both approach lead to "shrinkage":

* For the learning rate approach, the percentage of shrinkage is
uniform for any leaf node.

* For L2 regularization, the percentage of shrinkage would adapt to
the number of instances assigned to a leaf node: more instances =>
larger G and H => less shrinkage. This behavior is intuitive to me. If
the value estimated from this node is based on a large amount of data,
the value should be reliable and less shrinkage is needed.

I suppose we could have something similar for L1.

I am not aware of theoretical results to conclude which method is
better. Likely to be dependent on the data at hand. Implementing
learning rate is on my radar for version 0.2. I should be able to add
it in a week or so. I will send you a note once it is done.



On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai <> wrote:
> Hi Meihua,
> For categorical features, the ordinal issue can be solved by trying
> all kind of different partitions 2^(q-1) -1 for q values into two
> groups. However, it's computational expensive. In Hastie's book, in
> 9.2.4, the trees can be trained by sorting the residuals and being
> learnt as if they are ordered. It can be proven that it will give the
> optimal solution. I have a proof that this works for learning
> regression trees through variance reduction.
> I'm also interested in understanding how the L1 and L2 regularization
> within the boosting works (and if it helps with overfitting more than
> shrinkage).
> Thanks.
> Sincerely,
> DB Tsai
> ----------------------------------------------------------
> Web:
> PGP Key ID: 0xAF08DF8D
> On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu <> wrote:
>> Hi DB Tsai,
>> Thank you very much for your interest and comment.
>> 1) feature sub-sample is per-node, like random forest.
>> 2) The current code heavily exploits the tree structure to speed up
>> the learning (such as processing multiple learning node in one pass of
>> the training data). So a generic GBM is likely to be a different
>> codebase. Do you have any nice reference of efficient GBM? I am more
>> than happy to look into that.
>> 3) The algorithm accept training data as a DataFrame with the
>> featureCol indexed by VectorIndexer. You can specify which variable is
>> categorical in the VectorIndexer. Please note that currently all
>> categorical variables are treated as ordered. If you want some
>> categorical variables as unordered, you can pass the data through
>> OneHotEncoder before the VectorIndexer. I do have a plan to handle
>> unordered categorical variable using the approach in RF in Spark ML
>> (Please see roadmap in the
>> Thanks,
>> Meihua
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <> wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>> Sincerely,
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Web:
>>> PGP Key ID: 0xAF08DF8D
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>> <> wrote:
>>>> Hi Spark User/Dev,
>>>> Inspired by the success of XGBoost, I have created a Spark package for
>>>> gradient boosting tree with 2nd order approximation of arbitrary
>>>> user-defined loss functions.
>>>> Currently linear (normal) regression, binary classification, Poisson
>>>> regression are supported. You can extend with other loss function as
>>>> well.
>>>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>>> Thank you for testing. I am looking forward to your comments and
>>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>> Many thanks!
>>>> Meihua
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message