Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates
 209.85.212.42 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=rKidJ2AS8Vs2W7uob3S6C3gneFDPQ7JRpYZpMN+GioQb438TsRhAbBeRrQlUFYO8sP
         QpBdeFn0zmUUdmTUwUK9iEHYvagA5AP58/fV8bhyQakukzDExryzyS59CJitYtV5Yfi8
         OAq9PoMhlChYi3Zxqi1huuUZCNwIzV49YsFbw=
MIME-Version: 1.0
In-Reply-To: <BANLkTikQNNnWDMamBB5L_jBY3mHh=KFd2A@mail.gmail.com>
References: <BANLkTinaWb5jzKn0uEWDQ_rDW2WaTQqFPg@mail.gmail.com>
	<BANLkTinO-Qji7w-pJyyNJ1auGAGgMY2g5A@mail.gmail.com>
	<BANLkTim4vY_RJr9aF-xzWPnEmxObQ4DBjg@mail.gmail.com>
	<BANLkTin2G4CssobYMHDOcTHPi-2Z0Ye+kQ@mail.gmail.com>
	<BANLkTinN-GR8tVA1TPoUJLRrJWJ6aOD2tg@mail.gmail.com>
	<BANLkTi=15yFvFjT6-KJbH6YRYKMVrb_WKA@mail.gmail.com>
	<BANLkTin6ShNMKXmrh1iL72jGoUsVsgif0A@mail.gmail.com>
	<BANLkTik60yyfk46XGm2azZXMifn7ZdgJMg@mail.gmail.com>
	<BANLkTi=5AXVrK8ji-vLmidBaFpvOMyXvXg@mail.gmail.com>
	<BANLkTing1JOHs-PFSOr=5E123HcHCXA_Wg@mail.gmail.com>
	<BANLkTikQNNnWDMamBB5L_jBY3mHh=KFd2A@mail.gmail.com>
Date: Tue, 31 May 2011 19:19:42 -0700
Message-ID: <BANLkTikopvpt6iy8bsdDEHvoJ2hJYCvEdw@mail.gmail.com>
Subject: Re: Which exact algorithm is used in the Mahout SGD?
From: Dmitriy Lyubimov <dlieu.7@gmail.com>
To: user@mahout.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Interesting.
i'd probably be interested to try it out.


On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:
> Thanks Ted and Lance. And sorry for the jargon.
>
> For the delay Ted mentioned, we have already considered that, still thank=
s a
> lot for all the detail ideas, they were pretty helpful.
> For the parallelized SGD, just found a new paper about using DSGD in matr=
ix
> factorization, it's different from logistic regression, but might helpful=
 as
> well. Put the title here "Large-Scale Matrix Factorization with Distribut=
ed
> Stochastic Gradient Descent" if anyone is interested.
>
> Best wishes,
> Stanley Xu
> On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <ted.dunning@gmail.com> wrot=
e:
>
>> Yes.
>>
>> Apologies for jargon and TLA<
>> http://en.wikipedia.org/wiki/Three-letter_acronym>
>> 's
>>
>> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <goksron@gmail.com> wrote=
:
>>
>> > CTR =3D=3D Clickthrough Rate
>> >
>> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <ted.dunning@gmail.com>
>> > wrote:
>> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <wenhao.xu@gmail.com>
>> wrote:
>> > >
>> > >> ... I understood as the algorithm, the time in training only relies=
 on
>> > the
>> > >> non-zero records, but per our test, there would be some overhead we
>> > could
>> > >> not ignore for thoso non-zero records, though the cost is sub-linea=
r
>> or
>> > >> logit to the length of the hashed vector.
>> > >>
>> > >
>> > > This is pretty close if we say "non-zero values". =A0A record usuall=
y
>> > refers
>> > > to an entire training
>> > > example.
>> > >
>> > > The extra work refers mostly to deferred regularization that eventua=
lly
>> > has
>> > > to be
>> > > applied. =A0My guess is that it is even less than log in the feature
>> vector
>> > > size.
>> > >
>> > >
>> > >> And in CTR prediction, I am not pretty sure it will converge very
>> > quickly.
>> > >>
>> > >
>> > > I was saying this purely based on the number of features.
>> > >
>> > >
>> > >> Because we will very possibly see some records has the almost same
>> > feature
>> > >> but different result in display ads.
>> > >
>> > >
>> > > The algorithm can still converge to an estimate of the probability
>> here.
>> > >
>> > >
>> > >> But we will see the result in the
>> > >> future.
>> > >
>> > >
>> > > You have to be *very* careful about this to avoid prejudicing the mo=
del
>> > > against
>> > > recent impressions. =A0If you have a fast feedback to the ad targeti=
ng
>> > system,
>> > > you
>> > > can have severely instability.
>> > >
>> > > The key thing that you have to do to avoid these biases is to define=
 a
>> > > maximum
>> > > delay before click for the purposes of modeling. =A0You need to igno=
re
>> all
>> > > impressions
>> > > younger than this delay (because they may still get a click) and you
>> need
>> > to
>> > > ignore
>> > > all clicks after this delay (to avoid bias in favor of old
>> impressions).
>> > > =A0For on-line ads
>> > > you can probably use a maximum delay of a few minutes because most
>> clicks
>> > > will
>> > > happen by then.
>> > >
>> > > To find a good value for maximum delay, you should plot the CTR for =
a
>> > bunch
>> > > of
>> > > ads versus delay. =A0This will increase rapidly shortly after zero d=
elay,
>> > but
>> > > then will
>> > > level off. =A0The ordering of ads by CTR is what you care about so y=
ou
>> can
>> > > follow the
>> > > curves back and find the shortest delay where the ordering is clearl=
y
>> > > preserved. =A0Use
>> > > that as your maximum delay. =A0Typically this is roughly where your =
CTR
>> is
>> > at
>> > > about
>> > > 80-90% of the final value.
>> > >
>> > >
>> > >
>> > >
>> > >> (We were still working on creating a framework to digg all the
>> > >> features we need from the log, I would like to share our experience=
 by
>> > >> using
>> > >> Mahout SGD once we got our CTR prediction model release.)
>> > >>
>> > >> And for parallelize SGD, what do you mean for help with sparse inpu=
ts
>> > that
>> > >> exhibit long-tail frequency distribution? Would you like to share s=
ome
>> > of
>> > >> your ideas, Ted?
>> > >>
>> > >> Currently, what I could think about is split the data to multiple
>> mapper
>> > >> randomly and let every mapper to learn from the local data and get =
an
>> > >> average on the whole model, or let multiple model to vote for every
>> > >> feature's weight. A little like the idea of AdaBoost or RandomFores=
t.
>> > But I
>> > >> am not a scientist or mathematician, so no idea if it is correct or
>> not.
>> > >>
>> > >>
>> > >> Thanks so much.
>> > >> Stanley Xu
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <ted.dunning@gmail.co=
m>
>> > >> wrote:
>> > >>
>> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <wenhao.xu@gmail.com=
>
>> > >> wrote:
>> > >> >
>> > >> > > 1 hour is acceptable, but I guess you misunderstand the data sc=
ale
>> I
>> > >> mean
>> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines o=
f
>> > >> training
>> > >> > > set(900M training example.). If every training data has 1000
>> > dimension,
>> > >> > it
>> > >> > > means 900 million X 1000 X 16 B =3D 14TB. If we reduce the logs
>> > collected
>> > >> > to
>> > >> > > 14 days, it would be still 2-3TB data.
>> > >> > >
>> > >> >
>> > >> > Oops. =A0Forgot that last multiplier.
>> > >> >
>> > >> >
>> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, i=
t
>> > will
>> > >> > take
>> > >> > > about 1-2 hours to do the training, so 90M lines of data will c=
ost
>> > at
>> > >> > least
>> > >> > > 90 hours, is that correct?
>> > >> > >
>> > >> >
>> > >> > 10M x 1000 x 8 =3D 80 GB.
>> > >> >
>> > >> > 1-2 hours =3D (approx) 5000 seconds. =A0So this is
>> > >> >
>> > >> > 80 GB / 5000 s =3D 80/5 MB /s =3D 16MB / s
>> > >> >
>> > >> > Yes. =A0This is reasonable speed. =A0I think you can get a small =
factor
>> > >> faster
>> > >> > than this with SGD. =A0I have seen 100 million records with more
>> > non-zero
>> > >> > values than you describe with a training time of 3 hours. =A0I wo=
uld
>> not
>> > >> > expect even as much as a factor of 10 speedup here.
>> > >> >
>> > >> >
>> > >> > >
>> > >> > > And from the PPT you provided
>> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
>> > >> > > You said it would take less than an hour for 20M data records f=
or
>> > >> > > numeric/category mixed dimensions. I am wondering, how many
>> > dimensions
>> > >> > per
>> > >> > > record?
>> > >> > >
>> > >> >
>> > >> > These are sparse records records with about a thousand non-zero
>> > elements
>> > >> > per
>> > >> > record.
>> > >> >
>> > >> >
>> > >> > But let's step back to your data for a moment. =A0Where do these
>> > thousand
>> > >> > dimensions come from? =A0Do you really have a thousand hand-built
>> > features?
>> > >> > =A0Do you not have any sparse, text-like features?
>> > >> >
>> > >> > If you really only have a thousand dimensional problem, then I th=
ink
>> > your
>> > >> > model might exhibit early convergence.
>> > >> >
>> > >> > If not, it is quite possible to parallelize SGD, but this is only
>> > likely
>> > >> to
>> > >> > help with sparse inputs that exhibit long-tail frequency
>> distribution.
>> > >> >
>> > >>
>> > >
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>>
>