Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 722284709 for ; Mon, 23 May 2011 08:18:58 +0000 (UTC) Received: (qmail 19414 invoked by uid 500); 23 May 2011 08:18:57 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 19366 invoked by uid 500); 23 May 2011 08:18:56 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 19358 invoked by uid 99); 23 May 2011 08:18:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 May 2011 08:18:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wenhao.xu@gmail.com designates 209.85.161.42 as permitted sender) Received: from [209.85.161.42] (HELO mail-fx0-f42.google.com) (209.85.161.42) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 May 2011 08:18:51 +0000 Received: by fxm1 with SMTP id 1so8872900fxm.1 for ; Mon, 23 May 2011 01:18:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=XGbtiP8yCeYoyb4urftbOwx9fOa0+3Sl1+RUrb3/Rug=; b=XRG8xUy8YzeYrFQiTpA97SVGvaQuLfAwQ2g0EqUaPGjl11gRrwLzcP+0t99+9wyjmU 0kJ+pNgOK2Z/ViNqwFX+Ay7uarO1hRAotBRX3SDtQHf6Fqp6nrt9YC79M4IUq3/t3Qgc 7NudrQdt79QCAVOwKYJaAVwEdvOoQUCP0aeXQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=lz9VCLnKlwHCvHVblGuk84nq+J48tLBGi+pV2Fa7Xd6Z2kHJYgkGqA4tKIdQzqUahZ 5+nJl8MEX3s1I6cxihThMZs1dr4VSZRopUGgh5GL84wPBTWI4mxJjmUF3B2dtP0nDszb 05qyo5L8TpdZXGMcsQ8ZRlBfl3bkbt3aAd7c0= MIME-Version: 1.0 Received: by 10.223.29.132 with SMTP id q4mr2107890fac.17.1306138709956; Mon, 23 May 2011 01:18:29 -0700 (PDT) Received: by 10.223.108.4 with HTTP; Mon, 23 May 2011 01:18:29 -0700 (PDT) Date: Mon, 23 May 2011 16:18:29 +0800 Message-ID: Subject: SGD didn't work well with high dimension by a random generated data test. From: Stanley Xu To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=001517478f04d598b104a3ed1d36 --001517478f04d598b104a3ed1d36 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Dear All, I am trying to evaluate the correctness of the SGD algorithm in Mahout. I use a program to generate random weights, training data and test data and use OnlineLogisticRegression and AdaptiveLogisticRegression to train and classify the result. But it looks that the SGD didn't works well. I am wondering if I missed anything in using the SGD algorithm? I did the test with the following data set: 1. 10 feature dimension, value would be 0 or 1. Weight is generated randoml= y and the weight value scope would be from -5 to 5. The training data set is 10k records or 100 records. The data of negative and positive target would be 1:1. The classification on both the training data or test data looks fine to me. Both the false positive and false negative is less than 100, which would be less than 1%. 2. 100 feature dimension, value would be 0 or 1. Weight is generated randomly and the weight value scope would be from -5 to 5. The training dat= a set is 100k records to 1000k records. The data of negative and positive target would be 1:1. The classification on both the training data or test data is not very well. The false positive and false negative are all close to 10%. But the AUC is pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw OnlineLogisticRegression. 3. 100 feature dimension, but change the negative and positive target to 10:1 to match the real training set we will get. With the raw OnlineLogisticRegression, most of positive target will be predicted as negative(more than 90%). And the AUC decrease to 60=EF=BC=85. = Even worse, with the AdaptiveLogisticRegression, all the positive target will be predicted as negative, and AUC decreased to 58%. The code to generate the data could be found here. http://pastebin.com/GAA1di5z The code to train and classify the data could be found here. http://pastebin.com/EjMpGQ1h The parameters there could be changed to generate different set of data. I thought the incorrectness is unacceptable hight, especially with a data which has a perfect line which could separate the data. And, the incorrectness is unusually high in the training data set. I knew SGD is an approximate solution rather than an accurate one, but isn'= t 20% error in classification is too high? I understood for the unbalance positive and negative for the training set, we could add a weight in the training example. I have tried but it is also hard to decide the weight we should choose, and per my understand, we shoul= d also get the weight changed dynamically with the current learning rate. Since the high learning rate with a high weight will mis-lead the model to an incorrect direction. We have tried some strategy, but the efforts is not well, any tips on how to set the weight for SGD? Since it is not a global convex optimization solution comparing to other algorithm of Logistic Regression. Thanks. Best wishes, Stanley Xu --001517478f04d598b104a3ed1d36--