mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Xu <wenhao...@gmail.com>
Subject Re: Which exact algorithm is used in the Mahout SGD?
Date Tue, 26 Apr 2011 05:11:25 GMT
Thanks Ted. Read the paper and the code and got the rough idea of how the
iteration goes. Thanks so much.

With the current data scale we have, we were considering if we could train
more data with the Logistic Regression. For example, if we wanted to train a
model for CTR prediction for last 90 days data. It would be 900M records
after down sampling, and assume there are 1000 feature dimension there. It
would still be so slow by a single machine with the current SGD algorithm.

I wondering if there is a parallel algorithm with map-reduce I could use for
Logistic Regression? The original Newton-Raphson will take N*N*M/P by
the "Map-Reduce
for Machine Learning on Multicore" paper, which is much slower than SGD on a
single machine in a high-dimension space.

Could algorithm like IRLS be parallelized or any approximate algorithm there
could be parallelized?

Thanks,
Stanley Xu



On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Paul K described in memory algorithms in his dissertation.  Mahout uses
> on-line algorithms which are not limited by memory size.
>
> The method used in Mahout is closer to what Bob Carpenter describes here:
> http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf
>
> The most important additions in Mahout are:
>
> a) confidence weighted learning rates per term
>
> b) evolutionary tuning of hyper-parameters
>
> c) mixed ranking and regression
>
> d) grouped AUC
>
> On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
> > Dear All,
> >
> > I am trying to go through the Mahout SGD algorithm and trying to read
> > the "Logistic
> > Regression for Data Mining and High-Dimensional Classification" a little
> > bit, I am wondering which algorithm is exactly used in the SGD code?
> There
> > are quite a couple of algorithms mentioned in the paper, a little hard to
> > me
> > to find out the algorithm matched the code.
> >
> > Thanks in advance.
> >
> > Best wishes,
> > Stanley Xu
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message