mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lee (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression with Cross Validation on MapReduce
Date Sun, 04 Aug 2013 01:10:46 GMT


Jason Lee commented on MAHOUT-1273:

Hi Ted,
I have done some work on Penalized Linear Regression. I read through the draft and went over
the algorithm described. As described, I think the algorithm is correct. It uses Mapreduce
to compute X'X, X'y, and y'y, then minimizes over \beta: 1/2 \beta'(X'X) \beta -\beta' X'y
+1/2 y'y +\lambda P(\beta), where P(\beta) is the elastic net penalty function. It requires
one pass of Mapreduce to compute the quantities X'X, X'y, and y'y, and then the algorithm
uses the Pathwise Coordinate Descent Algorithm locally to solve the resulting problem. This
algorithm is extremely suitable when the number of features (p) is around O(10,000). It only
uses one pass of Mapreduce, so the primary computation is performed locally. 
> Single Pass Algorithm for Penalized Linear Regression with Cross Validation on MapReduce
> ----------------------------------------------------------------------------------------
>                 Key: MAHOUT-1273
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.9
>            Reporter: Kun Yang
>              Labels: documentation, features, patch, test
>             Fix For: 0.9
>         Attachments: Algorithm and Numeric Stability.pdf, java files.pdf, Manual and
Example.pdf, Notes.pdf, PenalizedLinear.pdf, PenalizedLinearRegression.patch
>   Original Estimate: 720h
>  Remaining Estimate: 720h
> Penalized linear regression such as Lasso, Elastic-net are widely used in machine learning,
but there are no very efficient scalable implementations on MapReduce.
> The published distributed algorithms for solving this problem is either iterative (which
is not good for MapReduce, see Steven Boyd's paper) or approximate (what if we need exact
solutions, see Paralleled stochastic gradient descent); another disadvantage of these algorithms
is that they can not do cross validation in the training phase, which requires a user-specified
penalty parameter in advance. 
> My ideas can train the model with cross validation in a single pass. They are based on
some simple observations.
> The core algorithm is a modified version of coordinate descent (see J. Freedman's paper).
They implemented a very efficient R package "glmnet", which is the de facto standard of penalized
> I have implemented the primitive version of this algorithm in Alpine Data Labs.  

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message