spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: [mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.
Date Wed, 26 Nov 2014 20:06:57 GMT
Hi Yanbo,

We scale the model coefficients back after training. So scaling in
prediction is not necessary.

We had some discussion about this. I'd like to treat feature scaling
as part of the feature transformation, and recommend users to apply
feature scaling before training. It is a cleaner solution to me, and
this is easy with the new pipeline API. DB (cc'ed) recommends
embedding feature scaling in linear methods, because it generally
leads better conditioning, which is also valid. Feel free to create a
JIRA and we can have the discussion there.

Best,
Xiangrui

On Wed, Nov 26, 2014 at 1:39 AM, Yanbo Liang <yanbohappy@gmail.com> wrote:
> Hi All,
>
> LogisticRegressionWithLBFGS set useFeatureScaling to true default which can
> improve the convergence during optimization.
> However, other model training method such as LogisticRegressionWithSGD does
> not set useFeatureScaling to true by default and the corresponding set
> function is private in mllib scope which can not be set by users.
>
> The default configuration will cause mismatch training and prediction.
> Suppose that users prepare input data for training set and predict set with
> the same format, then run model training with LogisticRegressionWithLBFGS
> and prediction.
> But they do not know that it contains feature scaling in training step but
> w/o it in prediction step.
> When prediction step, it will apply model on dataset whose extent or scope
> is not consistent with training step.
>
> Should we make setFeatureScaling function to public and change default value
> to false?
> I think it is more clear and comprehensive to make feature scale and
> normalization in preprocessing step of the machine learning pipeline.
> If this proposal is OK, I will file a JIRA to track.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message