Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com
 designates 209.85.220.170 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=sqhjZUmPFbiO8oyHUgnHfBa6ZhQJ7mKXn+yqPQWp/CDfxGE/kbHRaCTP/lalhuuDK1
         UUTUFmn6hDE2I+9fAOHYdgdujZRCu8NSqAWSQDvQf/WyrbuW1R1FWECtjr+RONoVIeuw
         0ip0fpI13oOK71avbsPRkRehQvhNy62iRVmyE=
MIME-Version: 1.0
In-Reply-To: <BANLkTik_nh2n=g3+2ANfAdyZcBNdeSLBwg@mail.gmail.com>
References: <BANLkTinUZYc7si+mymWyhM2nHvjuPznyFQ@mail.gmail.com>
 <BANLkTik_nh2n=g3+2ANfAdyZcBNdeSLBwg@mail.gmail.com>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Sun, 29 May 2011 16:04:46 -0700
Message-ID: <BANLkTi=fhriQ-Y59u-jddRRE2cbpxif5Xg@mail.gmail.com>
Subject: Re: SGD didn't work well with high dimension by a random generated
 data test.
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=bcaec50162219e79fd04a472335e

--bcaec50162219e79fd04a472335e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I have done some unrelated tests and I think that SGD has suffered from som=
e
unknown decrease in accuracy.  20 newsgroups used to get to 86% accuracy an=
d
now only gets to near 80.  When I find time I will try to figure out what
has happened.

Your test results may be related.

On Mon, May 23, 2011 at 4:22 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:

> Looks if I set decay to 1(no learning rate decay),remove the
> regularization,
> use the raw OnlineLogisticRegression and adjust the learning rate, the
> performance would be much better.
>
> Best wishes,
> Stanley Xu
>
>
>
> On Mon, May 23, 2011 at 4:18 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
> > Dear All,
> >
> > I am trying to evaluate the correctness of the SGD algorithm in Mahout.=
 I
> > use a program to generate random weights, training data and test data a=
nd
> > use OnlineLogisticRegression and AdaptiveLogisticRegression to train an=
d
> > classify the result. But it looks that the SGD didn't works well. I am
> > wondering if I missed anything in using the SGD algorithm?
> >
> > I did the test with the following data set:
> >
> > 1. 10 feature dimension, value would be 0 or 1. Weight is generated
> > randomly and the weight value scope would be from -5 to 5. The training
> data
> > set is 10k records or 100 records. The data of negative and positive
> > target would be 1:1.
> > The classification on both the training data or test data looks fine to
> me.
> > Both the false positive and false negative is less than 100, which woul=
d
> be
> > less than 1%.
> >
> > 2. 100 feature dimension, value would be 0 or 1. Weight is generated
> > randomly and the weight value scope would be from -5 to 5. The training
> data
> > set is 100k records to 1000k records. The data of negative and positive
> > target would be 1:1.
> > The classification on both the training data or test data is not very
> well.
> > The false positive and false negative are all close to 10%. But the AUC
> is
> > pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with ra=
w
> > OnlineLogisticRegression.
> >
> > 3. 100 feature dimension, but change the negative and positive target t=
o
> > 10:1 to match the real training set we will get.
> > With the raw OnlineLogisticRegression, most of positive target will be
> > predicted as negative(more than 90%). And the AUC decrease to 60=EF=BC=
=85. Even
> > worse, with the AdaptiveLogisticRegression, all the positive target wil=
l
> be
> > predicted as negative, and AUC decreased to 58%.
> >
> > The code to generate the data could be found here.
> > http://pastebin.com/GAA1di5z
> >
> > The code to train and classify the data could be found here.
> > http://pastebin.com/EjMpGQ1h
> >
> > The parameters there could be changed to generate different set of data=
.
> >
> > I thought the incorrectness is unacceptable hight, especially with a da=
ta
> > which has a perfect line which could separate the data. And, the
> > incorrectness is unusually high in the training data set.
> >
> > I knew SGD is an approximate solution rather than an accurate one, but
> > isn't 20% error in classification is too high?
> >
> > I understood for the unbalance positive and negative for the training
> set,
> > we could add a weight in the training example. I have tried but it is
> also
> > hard to decide the weight we should choose, and per my understand, we
> should
> > also get the weight changed dynamically with the current learning rate.
> > Since the high learning rate with a high weight will mis-lead the model
> to
> > an incorrect direction. We have tried some strategy, but the efforts is
> not
> > well, any tips on how to set the weight for SGD? Since it is not a glob=
al
> > convex optimization solution comparing to other algorithm of Logistic
> > Regression.
> >
> > Thanks.
> >
> >
> > Best wishes,
> > Stanley Xu
> >
> >
>

--bcaec50162219e79fd04a472335e--