Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 589F018662 for ; Wed, 28 Oct 2015 00:05:07 +0000 (UTC) Received: (qmail 57087 invoked by uid 500); 28 Oct 2015 00:05:02 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 56988 invoked by uid 500); 28 Oct 2015 00:05:02 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 56967 invoked by uid 99); 28 Oct 2015 00:05:02 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Oct 2015 00:05:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7352BC0FDD; Wed, 28 Oct 2015 00:05:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.95 X-Spam-Level: X-Spam-Status: No, score=0.95 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, KAM_ASCII_DIVIDERS=0.8, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id f7sd6kZ0xq-q; Wed, 28 Oct 2015 00:04:54 +0000 (UTC) Received: from mail-wi0-f195.google.com (mail-wi0-f195.google.com [209.85.212.195]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 35AC820F43; Wed, 28 Oct 2015 00:04:54 +0000 (UTC) Received: by wicuv6 with SMTP id uv6so30942391wic.2; Tue, 27 Oct 2015 17:04:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=KnU04LeVvysPIijhpLXlPWr4yXCmpV00J31F7nelZ3Y=; b=GBn3LRzo80jNsYpPeVr+T6ivTws2hYe5fL8XBmOg/UUlhozyU97FvdfXOX6BEtp/VD GIrt2GT6PSadyDDRugqu6MzJqaXG/wla6j0dLaoCiIsGvZ7YJv7JnaZV2vpt1fsJzynB XGPXEnTcyijpEjNZZFcTKRCEhCjdsWVXubPqqKUz54xSdh3GOPOS8rqQrSdjAA6eYQYA qpc3sZooQ5x3T5z+mAK5cK37xcMiQywbeUTcr0cqgwyppzljrf+dvlk98owsxpwXzcg0 PmOfYNB40XGvtJRDE0NZXHlT8BQTINgTm00FFSj7kX6V0uzEqAQHyL7a+IIKIoLf6qVp pvmA== MIME-Version: 1.0 X-Received: by 10.180.208.9 with SMTP id ma9mr27424075wic.47.1445990687234; Tue, 27 Oct 2015 17:04:47 -0700 (PDT) Received: by 10.28.45.9 with HTTP; Tue, 27 Oct 2015 17:04:47 -0700 (PDT) In-Reply-To: References: Date: Tue, 27 Oct 2015 17:04:47 -0700 Message-ID: Subject: Re: Spark Implementation of XGBoost From: Meihua Wu To: DB Tsai Cc: dev , user Content-Type: text/plain; charset=UTF-8 Hi DB Tsai, Thank you again for your insightful comments! 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based methods. How about I open a jira for this? 2) For L2/L1 regularization vs Learning rate (I use this name instead shrinkage to avoid confusion), I have the following observations: Suppose G and H are the sum (over the data assigned to a leaf node) of the 1st and 2nd derivative of the loss evaluated at f_m, respectively. Then for this leaf node, * With a learning rate eta, f_{m+1} = f_m - G/H*eta * With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda) If H>0 (convex loss), both approach lead to "shrinkage": * For the learning rate approach, the percentage of shrinkage is uniform for any leaf node. * For L2 regularization, the percentage of shrinkage would adapt to the number of instances assigned to a leaf node: more instances => larger G and H => less shrinkage. This behavior is intuitive to me. If the value estimated from this node is based on a large amount of data, the value should be reliable and less shrinkage is needed. I suppose we could have something similar for L1. I am not aware of theoretical results to conclude which method is better. Likely to be dependent on the data at hand. Implementing learning rate is on my radar for version 0.2. I should be able to add it in a week or so. I will send you a note once it is done. Thanks, Meihua On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai wrote: > Hi Meihua, > > For categorical features, the ordinal issue can be solved by trying > all kind of different partitions 2^(q-1) -1 for q values into two > groups. However, it's computational expensive. In Hastie's book, in > 9.2.4, the trees can be trained by sorting the residuals and being > learnt as if they are ordered. It can be proven that it will give the > optimal solution. I have a proof that this works for learning > regression trees through variance reduction. > > I'm also interested in understanding how the L1 and L2 regularization > within the boosting works (and if it helps with overfitting more than > shrinkage). > > Thanks. > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Web: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > > > On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu wrote: >> Hi DB Tsai, >> >> Thank you very much for your interest and comment. >> >> 1) feature sub-sample is per-node, like random forest. >> >> 2) The current code heavily exploits the tree structure to speed up >> the learning (such as processing multiple learning node in one pass of >> the training data). So a generic GBM is likely to be a different >> codebase. Do you have any nice reference of efficient GBM? I am more >> than happy to look into that. >> >> 3) The algorithm accept training data as a DataFrame with the >> featureCol indexed by VectorIndexer. You can specify which variable is >> categorical in the VectorIndexer. Please note that currently all >> categorical variables are treated as ordered. If you want some >> categorical variables as unordered, you can pass the data through >> OneHotEncoder before the VectorIndexer. I do have a plan to handle >> unordered categorical variable using the approach in RF in Spark ML >> (Please see roadmap in the README.md) >> >> Thanks, >> >> Meihua >> >> >> >> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai wrote: >>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do >>> you think you can implement generic GBM and have it merged as part of >>> Spark codebase? >>> >>> Sincerely, >>> >>> DB Tsai >>> ---------------------------------------------------------- >>> Web: https://www.dbtsai.com >>> PGP Key ID: 0xAF08DF8D >>> >>> >>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu >>> wrote: >>>> Hi Spark User/Dev, >>>> >>>> Inspired by the success of XGBoost, I have created a Spark package for >>>> gradient boosting tree with 2nd order approximation of arbitrary >>>> user-defined loss functions. >>>> >>>> https://github.com/rotationsymmetry/SparkXGBoost >>>> >>>> Currently linear (normal) regression, binary classification, Poisson >>>> regression are supported. You can extend with other loss function as >>>> well. >>>> >>>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting. >>>> >>>> Thank you for testing. I am looking forward to your comments and >>>> suggestions. Bugs or improvements can be reported through GitHub. >>>> >>>> Many thanks! >>>> >>>> Meihua >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org >>>> For additional commands, e-mail: user-help@spark.apache.org >>>> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org