mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Xu <wenhao...@gmail.com>
Subject Re: Which exact algorithm is used in the Mahout SGD?
Date Tue, 26 Apr 2011 06:46:36 GMT
Hi Ted,

1 hour is acceptable, but I guess you misunderstand the data scale I mean
here. The 900M records didn't mean 900M Bytes, but 900M lines of training
set(900M training example.). If every training data has 1000 dimension, it
means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected to
14 days, it would be still 2-3TB data.

Per our simple test, for 1000 dimension, 10M lines of record, it will take
about 1-2 hours to do the training, so 90M lines of data will cost at least
90 hours, is that correct?

And from the PPT you provided
http://www.slideshare.net/tdunning/sdforum-11042010
You said it would take less than an hour for 20M data records for
numeric/category mixed dimensions. I am wondering, how many dimensions per
record?

Thanks.
Stanley Xu



On Tue, Apr 26, 2011 at 2:05 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> How much time do you have available for training?
>
> If you can do feature encoding in parallel, then you can probably do this
> pretty fast with SGD.
>
> My guess is that you can push 2-20 MB/s of data through SGD with your kind
> of  data with a good 8 core processor.  If you pre-process your data into 8
> B / dimension, this is 0.25 - 2.5 million data points per second.  This
> could mean that your training takes less than an hour.  If your training
> converges with less data, you may do even better.
>
> Is that not acceptable?
>
> On Mon, Apr 25, 2011 at 10:11 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
> > Thanks Ted. Read the paper and the code and got the rough idea of how the
> > iteration goes. Thanks so much.
> >
> > With the current data scale we have, we were considering if we could
> train
> > more data with the Logistic Regression. For example, if we wanted to
> train
> > a
> > model for CTR prediction for last 90 days data. It would be 900M records
> > after down sampling, and assume there are 1000 feature dimension there.
> It
> > would still be so slow by a single machine with the current SGD
> algorithm.
> >
> > I wondering if there is a parallel algorithm with map-reduce I could use
> > for
> > Logistic Regression? The original Newton-Raphson will take N*N*M/P by
> > the "Map-Reduce
> > for Machine Learning on Multicore" paper, which is much slower than SGD
> on
> > a
> > single machine in a high-dimension space.
> >
> > Could algorithm like IRLS be parallelized or any approximate algorithm
> > there
> > could be parallelized?
> >
> > Thanks,
> > Stanley Xu
> >
> >
> >
> > On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > Paul K described in memory algorithms in his dissertation.  Mahout uses
> > > on-line algorithms which are not limited by memory size.
> > >
> > > The method used in Mahout is closer to what Bob Carpenter describes
> here:
> > > http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf
> > >
> > > The most important additions in Mahout are:
> > >
> > > a) confidence weighted learning rates per term
> > >
> > > b) evolutionary tuning of hyper-parameters
> > >
> > > c) mixed ranking and regression
> > >
> > > d) grouped AUC
> > >
> > > On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <wenhao.xu@gmail.com>
> wrote:
> > >
> > > > Dear All,
> > > >
> > > > I am trying to go through the Mahout SGD algorithm and trying to read
> > > > the "Logistic
> > > > Regression for Data Mining and High-Dimensional Classification" a
> > little
> > > > bit, I am wondering which algorithm is exactly used in the SGD code?
> > > There
> > > > are quite a couple of algorithms mentioned in the paper, a little
> hard
> > to
> > > > me
> > > > to find out the algorithm matched the code.
> > > >
> > > > Thanks in advance.
> > > >
> > > > Best wishes,
> > > > Stanley Xu
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message