Hi Ted,
1 hour is acceptable, but I guess you misunderstand the data scale I mean
here. The 900M records didn't mean 900M Bytes, but 900M lines of training
set(900M training example.). If every training data has 1000 dimension, it
means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected to
14 days, it would be still 23TB data.
Per our simple test, for 1000 dimension, 10M lines of record, it will take
about 12 hours to do the training, so 90M lines of data will cost at least
90 hours, is that correct?
And from the PPT you provided
http://www.slideshare.net/tdunning/sdforum11042010
You said it would take less than an hour for 20M data records for
numeric/category mixed dimensions. I am wondering, how many dimensions per
record?
Thanks.
Stanley Xu
On Tue, Apr 26, 2011 at 2:05 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> How much time do you have available for training?
>
> If you can do feature encoding in parallel, then you can probably do this
> pretty fast with SGD.
>
> My guess is that you can push 220 MB/s of data through SGD with your kind
> of data with a good 8 core processor. If you preprocess your data into 8
> B / dimension, this is 0.25  2.5 million data points per second. This
> could mean that your training takes less than an hour. If your training
> converges with less data, you may do even better.
>
> Is that not acceptable?
>
> On Mon, Apr 25, 2011 at 10:11 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
> > Thanks Ted. Read the paper and the code and got the rough idea of how the
> > iteration goes. Thanks so much.
> >
> > With the current data scale we have, we were considering if we could
> train
> > more data with the Logistic Regression. For example, if we wanted to
> train
> > a
> > model for CTR prediction for last 90 days data. It would be 900M records
> > after down sampling, and assume there are 1000 feature dimension there.
> It
> > would still be so slow by a single machine with the current SGD
> algorithm.
> >
> > I wondering if there is a parallel algorithm with mapreduce I could use
> > for
> > Logistic Regression? The original NewtonRaphson will take N*N*M/P by
> > the "MapReduce
> > for Machine Learning on Multicore" paper, which is much slower than SGD
> on
> > a
> > single machine in a highdimension space.
> >
> > Could algorithm like IRLS be parallelized or any approximate algorithm
> > there
> > could be parallelized?
> >
> > Thanks,
> > Stanley Xu
> >
> >
> >
> > On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > Paul K described in memory algorithms in his dissertation. Mahout uses
> > > online algorithms which are not limited by memory size.
> > >
> > > The method used in Mahout is closer to what Bob Carpenter describes
> here:
> > > http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf
> > >
> > > The most important additions in Mahout are:
> > >
> > > a) confidence weighted learning rates per term
> > >
> > > b) evolutionary tuning of hyperparameters
> > >
> > > c) mixed ranking and regression
> > >
> > > d) grouped AUC
> > >
> > > On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <wenhao.xu@gmail.com>
> wrote:
> > >
> > > > Dear All,
> > > >
> > > > I am trying to go through the Mahout SGD algorithm and trying to read
> > > > the "Logistic
> > > > Regression for Data Mining and HighDimensional Classification" a
> > little
> > > > bit, I am wondering which algorithm is exactly used in the SGD code?
> > > There
> > > > are quite a couple of algorithms mentioned in the paper, a little
> hard
> > to
> > > > me
> > > > to find out the algorithm matched the code.
> > > >
> > > > Thanks in advance.
> > > >
> > > > Best wishes,
> > > > Stanley Xu
> > > >
> > >
> >
>
