spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Bradley <jos...@databricks.com>
Subject Re: Have Friedman's glmnet algo running in Spark
Date Tue, 24 Feb 2015 23:27:03 GMT
Hi Mike,

I'm not aware of a "standard" big dataset, but there are a number available:
* The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
instances but not # features):
www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
* I've used this text dataset from which one can generate lots of n-gram
features (but not many instances): http://www.ark.cs.cmu.edu/10K/
* I've seen some papers use the KDD Cup datasets, which might be the best
option I know of.  The KDD Cup 2012 track 2 one seems promising.

Good luck!
Joseph

On Tue, Feb 24, 2015 at 1:56 PM, <mike@mbowles.com> wrote:

> Joseph,
> Thanks for your reply.  We'll take the steps you suggest - generate some
> timing comparisons and post them in the GLMNET JIRA with a link from the
> OWLQN JIRA.
>
> We've got the regression version of GLMNET programmed.  The regression
> version only requires a pass through the data each time the active set of
> coefficients changes.  That's usualy less than or equal to the number of
> decrements in the penalty coefficient (typical default = 100).  The
> intermediate iterations can be done using results of previous passes
> through the full data set.  We're expecting the number of data passes will
> be independent of either number of rows or columns in the data set.  We're
> eager to demonstrate this scaling.  Do you have any suggestions regarding
> data sets for large scale regression problems?  It would be nice to
> demonstrate scaling for both number of rows and number of columns.
>
> Thanks for your help.
> Mike
>
> -----Original Message-----
> *From:* Joseph Bradley [mailto:joseph@databricks.com]
> *Sent:* Sunday, February 22, 2015 06:48 PM
> *To:* mike@mbowles.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Have Friedman's glmnet algo running in Spark
>
> Hi Mike, glmnet has definitely been very successful, and it would be great
> to see how we can improve optimization in MLlib! There is some related work
> ongoing; here are the JIRAs: GLMNET implementation in Spark
> LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
> The GLMNET JIRA has actually been closed in favor of the latter JIRA.
> However, if you're getting good results in your experiments, could you
> please post them on the GLMNET JIRA and link them from the other JIRA? If
> it's faster and more scalable, that would be great to find out. As far as
> where the code should go and the APIs, that can be discussed on the JIRA. I
> hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph
> On Thu, Feb 19, 2015 at 10:59 AM,  wrote: > Dev List, > A couple of
> colleagues and I have gotten several versions of glmnet algo > coded and
> running on Spark RDD. glmnet algo ( >
> http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for >
> generating coefficient paths solving penalized regression with elastic net
> > penalties. The algorithm runs fast by taking an approach that generates >
> solutions for a wide variety of penalty parameter. We're able to integrate
> > into Mllib class structure a couple of different ways. The algorithm may
> > fit better into the new pipeline structure since it naturally returns a >
> multitide of models (corresponding to different vales of penalty >
> parameters). That appears to fit better into pipeline than Mllib linear >
> regression (for example). > > We've got regression running with the speed
> optimizations that Friedman > recommends. We'll start working on the
> logistic regression version next. > > We're eager to make the code
> available as open source and would like to > get some feedback about how
> best to do that. Any thoughts? > Mike Bowles. > > >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message