mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Which exact algorithm is used in the Mahout SGD?
Date Thu, 28 Apr 2011 19:06:39 GMT
On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:

> ... I understood as the algorithm, the time in training only relies on the
> non-zero records, but per our test, there would be some overhead we could
> not ignore for thoso non-zero records, though the cost is sub-linear or
> logit to the length of the hashed vector.
>

This is pretty close if we say "non-zero values".  A record usually refers
to an entire training
example.

The extra work refers mostly to deferred regularization that eventually has
to be
applied.  My guess is that it is even less than log in the feature vector
size.


> And in CTR prediction, I am not pretty sure it will converge very quickly.
>

I was saying this purely based on the number of features.


> Because we will very possibly see some records has the almost same feature
> but different result in display ads.


The algorithm can still converge to an estimate of the probability here.


> But we will see the result in the
> future.


You have to be *very* careful about this to avoid prejudicing the model
against
recent impressions.  If you have a fast feedback to the ad targeting system,
you
can have severely instability.

The key thing that you have to do to avoid these biases is to define a
maximum
delay before click for the purposes of modeling.  You need to ignore all
impressions
younger than this delay (because they may still get a click) and you need to
ignore
all clicks after this delay (to avoid bias in favor of old impressions).
 For on-line ads
you can probably use a maximum delay of a few minutes because most clicks
will
happen by then.

To find a good value for maximum delay, you should plot the CTR for a bunch
of
ads versus delay.  This will increase rapidly shortly after zero delay, but
then will
level off.  The ordering of ads by CTR is what you care about so you can
follow the
curves back and find the shortest delay where the ordering is clearly
preserved.  Use
that as your maximum delay.  Typically this is roughly where your CTR is at
about
80-90% of the final value.




> (We were still working on creating a framework to digg all the
> features we need from the log, I would like to share our experience by
> using
> Mahout SGD once we got our CTR prediction model release.)
>
> And for parallelize SGD, what do you mean for help with sparse inputs that
> exhibit long-tail frequency distribution? Would you like to share some of
> your ideas, Ted?
>
> Currently, what I could think about is split the data to multiple mapper
> randomly and let every mapper to learn from the local data and get an
> average on the whole model, or let multiple model to vote for every
> feature's weight. A little like the idea of AdaBoost or RandomForest. But I
> am not a scientist or mathematician, so no idea if it is correct or not.
>
>
> Thanks so much.
> Stanley Xu
>
>
>
> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <wenhao.xu@gmail.com>
> wrote:
> >
> > > 1 hour is acceptable, but I guess you misunderstand the data scale I
> mean
> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
> training
> > > set(900M training example.). If every training data has 1000 dimension,
> > it
> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected
> > to
> > > 14 days, it would be still 2-3TB data.
> > >
> >
> > Oops.  Forgot that last multiplier.
> >
> >
> > > Per our simple test, for 1000 dimension, 10M lines of record, it will
> > take
> > > about 1-2 hours to do the training, so 90M lines of data will cost at
> > least
> > > 90 hours, is that correct?
> > >
> >
> > 10M x 1000 x 8 = 80 GB.
> >
> > 1-2 hours = (approx) 5000 seconds.  So this is
> >
> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> >
> > Yes.  This is reasonable speed.  I think you can get a small factor
> faster
> > than this with SGD.  I have seen 100 million records with more non-zero
> > values than you describe with a training time of 3 hours.  I would not
> > expect even as much as a factor of 10 speedup here.
> >
> >
> > >
> > > And from the PPT you provided
> > > http://www.slideshare.net/tdunning/sdforum-11042010
> > > You said it would take less than an hour for 20M data records for
> > > numeric/category mixed dimensions. I am wondering, how many dimensions
> > per
> > > record?
> > >
> >
> > These are sparse records records with about a thousand non-zero elements
> > per
> > record.
> >
> >
> > But let's step back to your data for a moment.  Where do these thousand
> > dimensions come from?  Do you really have a thousand hand-built features?
> >  Do you not have any sparse, text-like features?
> >
> > If you really only have a thousand dimensional problem, then I think your
> > model might exhibit early convergence.
> >
> > If not, it is quite possible to parallelize SGD, but this is only likely
> to
> > help with sparse inputs that exhibit long-tail frequency distribution.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message