Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E016A4EE2 for ; Sun, 29 May 2011 23:05:35 +0000 (UTC) Received: (qmail 13302 invoked by uid 500); 29 May 2011 23:05:35 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 13272 invoked by uid 500); 29 May 2011 23:05:34 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 13264 invoked by uid 99); 29 May 2011 23:05:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 May 2011 23:05:34 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vx0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 May 2011 23:05:27 +0000 Received: by vxb40 with SMTP id 40so5581648vxb.1 for ; Sun, 29 May 2011 16:05:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=y1mj/NdEeNijB4URD9EdjDEk38+IIex9POYBH5jdits=; b=IwT5qcEZOw7sirlgTm5RVPxDMkqIuwE3fRnccHnY2+JXZBk7LIfK/fF5I2FZS2cM2V WWEG1Q8utZiClMJRgXkaYSQDOMaQzskFhjAXpip3HHghdFmR4tX5QmTIFLgf/yVVsNQn J00hDqCP8dLoqKYkw7ACWN6C2MXJwRZf0HZrI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=sqhjZUmPFbiO8oyHUgnHfBa6ZhQJ7mKXn+yqPQWp/CDfxGE/kbHRaCTP/lalhuuDK1 UUTUFmn6hDE2I+9fAOHYdgdujZRCu8NSqAWSQDvQf/WyrbuW1R1FWECtjr+RONoVIeuw 0ip0fpI13oOK71avbsPRkRehQvhNy62iRVmyE= Received: by 10.52.76.102 with SMTP id j6mr2924392vdw.44.1306710306093; Sun, 29 May 2011 16:05:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.110.101 with HTTP; Sun, 29 May 2011 16:04:46 -0700 (PDT) In-Reply-To: References: From: Ted Dunning Date: Sun, 29 May 2011 16:04:46 -0700 Message-ID: Subject: Re: SGD didn't work well with high dimension by a random generated data test. To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=bcaec50162219e79fd04a472335e X-Virus-Checked: Checked by ClamAV on apache.org --bcaec50162219e79fd04a472335e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I have done some unrelated tests and I think that SGD has suffered from som= e unknown decrease in accuracy. 20 newsgroups used to get to 86% accuracy an= d now only gets to near 80. When I find time I will try to figure out what has happened. Your test results may be related. On Mon, May 23, 2011 at 4:22 AM, Stanley Xu wrote: > Looks if I set decay to 1(no learning rate decay),remove the > regularization, > use the raw OnlineLogisticRegression and adjust the learning rate, the > performance would be much better. > > Best wishes, > Stanley Xu > > > > On Mon, May 23, 2011 at 4:18 PM, Stanley Xu wrote: > > > Dear All, > > > > I am trying to evaluate the correctness of the SGD algorithm in Mahout.= I > > use a program to generate random weights, training data and test data a= nd > > use OnlineLogisticRegression and AdaptiveLogisticRegression to train an= d > > classify the result. But it looks that the SGD didn't works well. I am > > wondering if I missed anything in using the SGD algorithm? > > > > I did the test with the following data set: > > > > 1. 10 feature dimension, value would be 0 or 1. Weight is generated > > randomly and the weight value scope would be from -5 to 5. The training > data > > set is 10k records or 100 records. The data of negative and positive > > target would be 1:1. > > The classification on both the training data or test data looks fine to > me. > > Both the false positive and false negative is less than 100, which woul= d > be > > less than 1%. > > > > 2. 100 feature dimension, value would be 0 or 1. Weight is generated > > randomly and the weight value scope would be from -5 to 5. The training > data > > set is 100k records to 1000k records. The data of negative and positive > > target would be 1:1. > > The classification on both the training data or test data is not very > well. > > The false positive and false negative are all close to 10%. But the AUC > is > > pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with ra= w > > OnlineLogisticRegression. > > > > 3. 100 feature dimension, but change the negative and positive target t= o > > 10:1 to match the real training set we will get. > > With the raw OnlineLogisticRegression, most of positive target will be > > predicted as negative(more than 90%). And the AUC decrease to 60=EF=BC= =85. Even > > worse, with the AdaptiveLogisticRegression, all the positive target wil= l > be > > predicted as negative, and AUC decreased to 58%. > > > > The code to generate the data could be found here. > > http://pastebin.com/GAA1di5z > > > > The code to train and classify the data could be found here. > > http://pastebin.com/EjMpGQ1h > > > > The parameters there could be changed to generate different set of data= . > > > > I thought the incorrectness is unacceptable hight, especially with a da= ta > > which has a perfect line which could separate the data. And, the > > incorrectness is unusually high in the training data set. > > > > I knew SGD is an approximate solution rather than an accurate one, but > > isn't 20% error in classification is too high? > > > > I understood for the unbalance positive and negative for the training > set, > > we could add a weight in the training example. I have tried but it is > also > > hard to decide the weight we should choose, and per my understand, we > should > > also get the weight changed dynamically with the current learning rate. > > Since the high learning rate with a high weight will mis-lead the model > to > > an incorrect direction. We have tried some strategy, but the efforts is > not > > well, any tips on how to set the weight for SGD? Since it is not a glob= al > > convex optimization solution comparing to other algorithm of Logistic > > Regression. > > > > Thanks. > > > > > > Best wishes, > > Stanley Xu > > > > > --bcaec50162219e79fd04a472335e--