Thanks Ted. Read the paper and the code and got the rough idea of how the
iteration goes. Thanks so much.
With the current data scale we have, we were considering if we could train
more data with the Logistic Regression. For example, if we wanted to train a
model for CTR prediction for last 90 days data. It would be 900M records
after down sampling, and assume there are 1000 feature dimension there. It
would still be so slow by a single machine with the current SGD algorithm.
I wondering if there is a parallel algorithm with mapreduce I could use for
Logistic Regression? The original NewtonRaphson will take N*N*M/P by
the "MapReduce
for Machine Learning on Multicore" paper, which is much slower than SGD on a
single machine in a highdimension space.
Could algorithm like IRLS be parallelized or any approximate algorithm there
could be parallelized?
Thanks,
Stanley Xu
On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Paul K described in memory algorithms in his dissertation. Mahout uses
> online algorithms which are not limited by memory size.
>
> The method used in Mahout is closer to what Bob Carpenter describes here:
> http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf
>
> The most important additions in Mahout are:
>
> a) confidence weighted learning rates per term
>
> b) evolutionary tuning of hyperparameters
>
> c) mixed ranking and regression
>
> d) grouped AUC
>
> On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
> > Dear All,
> >
> > I am trying to go through the Mahout SGD algorithm and trying to read
> > the "Logistic
> > Regression for Data Mining and HighDimensional Classification" a little
> > bit, I am wondering which algorithm is exactly used in the SGD code?
> There
> > are quite a couple of algorithms mentioned in the paper, a little hard to
> > me
> > to find out the algorithm matched the code.
> >
> > Thanks in advance.
> >
> > Best wishes,
> > Stanley Xu
> >
>
