Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAEYYnxYYD4zJy+1a4o6CcyktoCG3+bqnkqbDsrPMM-ytezE=nw@mail.gmail.com>
References: 
 <CAOcejD=Si+Kk0XN==HdXCVUqrVQDNLVfXNZ1sEG3w+vvguGeBQ@mail.gmail.com>
	<CAEYYnxa90bcPA_6sXCVWAVtZ4MQu7eiREi3uaBnHERRpxmUTcg@mail.gmail.com>
	<CAOcejDnjN2BweapLTURbusnWdvdM_560+-OjyWZg-HH0PbvpuA@mail.gmail.com>
	<CAEYYnxYYD4zJy+1a4o6CcyktoCG3+bqnkqbDsrPMM-ytezE=nw@mail.gmail.com>
Date: Tue, 27 Oct 2015 17:04:47 -0700
Message-ID: 
 <CAOcejDmWA3pjPb2JR6+TGLFTPFPwKVyy8VRQHub2EYPKwu--Hg@mail.gmail.com>
Subject: Re: Spark Implementation of XGBoost
From: Meihua Wu <rotationsymmetry14@gmail.com>
To: DB Tsai <dbtsai@dbtsai.com>
Cc: dev <dev@spark.apache.org>, user <user@spark.apache.org>
Content-Type: text/plain; charset=UTF-8

Hi DB Tsai,

Thank you again for your insightful comments!

1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?

2) For L2/L1 regularization vs Learning rate (I use this name instead
shrinkage to avoid confusion), I have the following observations:

Suppose G and H are the sum (over the data assigned to a leaf node) of
the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
Then for this leaf node,

* With a learning rate eta, f_{m+1} = f_m - G/H*eta

* With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)

If H>0 (convex loss), both approach lead to "shrinkage":

* For the learning rate approach, the percentage of shrinkage is
uniform for any leaf node.

* For L2 regularization, the percentage of shrinkage would adapt to
the number of instances assigned to a leaf node: more instances =>
larger G and H => less shrinkage. This behavior is intuitive to me. If
the value estimated from this node is based on a large amount of data,
the value should be reliable and less shrinkage is needed.

I suppose we could have something similar for L1.

I am not aware of theoretical results to conclude which method is
better. Likely to be dependent on the data at hand. Implementing
learning rate is on my radar for version 0.2. I should be able to add
it in a week or so. I will send you a note once it is done.

Thanks,

Meihua

On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai <dbtsai@dbtsai.com> wrote:
> Hi Meihua,
>
> For categorical features, the ordinal issue can be solved by trying
> all kind of different partitions 2^(q-1) -1 for q values into two
> groups. However, it's computational expensive. In Hastie's book, in
> 9.2.4, the trees can be trained by sorting the residuals and being
> learnt as if they are ordered. It can be proven that it will give the
> optimal solution. I have a proof that this works for learning
> regression trees through variance reduction.
>
> I'm also interested in understanding how the L1 and L2 regularization
> within the boosting works (and if it helps with overfitting more than
> shrinkage).
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu <rotationsymmetry14@gmail.com> wrote:
>> Hi DB Tsai,
>>
>> Thank you very much for your interest and comment.
>>
>> 1) feature sub-sample is per-node, like random forest.
>>
>> 2) The current code heavily exploits the tree structure to speed up
>> the learning (such as processing multiple learning node in one pass of
>> the training data). So a generic GBM is likely to be a different
>> codebase. Do you have any nice reference of efficient GBM? I am more
>> than happy to look into that.
>>
>> 3) The algorithm accept training data as a DataFrame with the
>> featureCol indexed by VectorIndexer. You can specify which variable is
>> categorical in the VectorIndexer. Please note that currently all
>> categorical variables are treated as ordered. If you want some
>> categorical variables as unordered, you can pass the data through
>> OneHotEncoder before the VectorIndexer. I do have a plan to handle
>> unordered categorical variable using the approach in RF in Spark ML
>> (Please see roadmap in the README.md)
>>
>> Thanks,
>>
>> Meihua
>>
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbtsai@dbtsai.com> wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>> <rotationsymmetry14@gmail.com> wrote:
>>>> Hi Spark User/Dev,
>>>>
>>>> Inspired by the success of XGBoost, I have created a Spark package for
>>>> gradient boosting tree with 2nd order approximation of arbitrary
>>>> user-defined loss functions.
>>>>
>>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>>
>>>> Currently linear (normal) regression, binary classification, Poisson
>>>> regression are supported. You can extend with other loss function as
>>>> well.
>>>>
>>>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>>>
>>>> Thank you for testing. I am looking forward to your comments and
>>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>>
>>>> Many thanks!
>>>>
>>>> Meihua
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org