spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs
Date Fri, 03 Apr 2015 21:39:55 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395122#comment-14395122
] 

Joseph K. Bradley commented on SPARK-6683:
------------------------------------------

Great, it sounds like we're in agreement about the API and algorithm behavior.  W.r.t. implementation,
I haven't thought through it too carefully.  I would have thought squared error would be the
easiest loss to handle since (I believe) it would reduce to scaling stepSize for each feature
(applied to the loss gradient, not the regularization gradient).  I'm not sure about the others...

> Handling feature scaling properly for GLMs
> ------------------------------------------
>
>                 Key: SPARK-6683
>                 URL: https://issues.apache.org/jira/browse/SPARK-6683
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
> * improves optimization behavior (essentially always improves behavior in practice)
> * changes the optimal solution (often for the better in terms of standardizing feature
importance)
> Current problems:
> * Inefficient implementation: We make a rescaled copy of the data.
> * Surprising API: For algorithms which use feature scaling, users may get different solutions
than with R or other libraries.  (Note: Feature scaling could be handled without changing
the solution.)
> * Inconsistent API: Not all algorithms have the same default for feature scaling, and
not all expose the option.
> This is a proposal discussed with [~mengxr] for an "ideal" solution.  This solution will
require some breaking API changes, but I'd argue these are necessary for the long-term since
it's the best API we have thought of.
> Proposal:
> * Implementation: Change to avoid making a rescaled copy of the data (described below).
 No API issues here.
> * API:
> ** Hide featureScaling from API. (breaking change)
> ** Internally, handle feature scaling to improve optimization, but modify it so it does
not change the optimal solution. (breaking change, in terms of algorithm behavior)
> ** Externally, users who want to rescale feature (to change the solution) should do that
scaling as a preprocessing step.
> Details on implementation:
> * GradientDescent could instead scale the step size separately for each feature (and
adjust regularization as needed; see the PR linked above).  This would require storing a vector
of length numFeatures, rather than making a full copy of the data.
> * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message