mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kun Yang <kuny...@stanford.edu>
Subject Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce
Date Sun, 21 Jul 2013 20:41:45 GMT
Hi,

Thanks for the feedback.

The algorithm is not solving the normal equation as in the ordinary linear regression. I did
not detail the algorithm to solve the penalized optimization in the paper. To solve the penalized
version, I will use the coordinate descent which is well documented in other paper (see Freedman's
paper, for 1000 variables, it takes ~1min to do cross validation in the R package) and is
very efficient.

As I discussed in the conclusion section, to solve the problem with large number of predictors,
it is still a challenge even though in the single machine or MPI version, but the proposed
algorithm can handle the number of variable at the order of 5000 and it covers lots of applications.

My plan is to implement a working version first then add some refinements such as sparse vectors.

Any feedback?
Best
-Kun

----- Original Message -----
From: "Ted Dunning (JIRA)" <jira@apache.org>
To: dev@mahout.apache.org
Sent: Sunday, July 21, 2013 1:10:49 PM
Subject: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear
Regression on MapReduce


    [ https://issues.apache.org/jira/browse/MAHOUT-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714796#comment-13714796
] 

Ted Dunning edited comment on MAHOUT-1273 at 7/21/13 8:09 PM:
--------------------------------------------------------------

The algorithm document describes a standard method for solving the normal equations in a single
pass.  This works fine with a small number of variables, but is completely infeasible if we
have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized
version of the problem.  A penalty function, p_lambda( x ), is mentioned, but the algorithm
quoted ignores this penalty and solves the unpenalized form.


                
      was (Author: tdunning):
    The algorithm document describes a standard method for solving the normal equations in
a single pass.  This works fine with a small number of variables, but is completely infeasible
if we have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized
version of the problem.  A penalty function, p_lambda(x), is mentioned, but the algorithm
quoted ignores this penalty and solves the unpenalized form.


                  
> Single Pass Algorithm for Penalized Linear Regression on MapReduce
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-1273
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1273
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Kun Yang
>         Attachments: PenalizedLinear.pdf
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Penalized linear regression such as Lasso, Elastic-net are widely used in machine learning,
but there are no very efficient scalable implementations on MapReduce.
> The published distributed algorithms for solving this problem is either iterative (which
is not good for MapReduce, see Steven Boyd's paper) or approximate (what if we need exact
solutions, see Paralleled stochastic gradient descent); another disadvantage of these algorithms
is that they can not do cross validation in the training phase, which requires a user-specified
penalty parameter in advance. 
> My ideas can train the model with cross validation in a single pass. They are based on
some simple observations.
> I have implemented the primitive version of this algorithm in Alpine Data Labs. Advanced
features such as inner-mapper combiner are employed to reduce the network traffic in the shuffle
phase.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message