spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <>
Subject [mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.
Date Wed, 26 Nov 2014 09:39:06 GMT
Hi All,

LogisticRegressionWithLBFGS set useFeatureScaling to true default which can
improve the convergence during optimization.
However, other model training method such as LogisticRegressionWithSGD does
not set useFeatureScaling to true by default and the corresponding set
function is private in mllib scope which can not be set by users.

The default configuration will cause mismatch training and prediction.
Suppose that users prepare input data for training set and predict set with
the same format, then run model training with LogisticRegressionWithLBFGS
and prediction.
But they do not know that it contains feature scaling in training step but
w/o it in prediction step.
When prediction step, it will apply model on dataset whose extent or scope
is not consistent with training step.

Should we make setFeatureScaling function to public and change default
value to false?
I think it is more clear and comprehensive to make feature scale and
normalization in preprocessing step of the machine learning pipeline.
If this proposal is OK, I will file a JIRA to track.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message