spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valeriy Avanesov <acop...@gmail.com>
Subject Re: [MLLib] Logistic Regression and standadization
Date Sat, 28 Apr 2018 14:10:36 GMT
Hi Joseph,

I've just tried that out. MLLib indeed returns different models. I see 
no problem here then. How can Filipp's issue be possible?

Best,

Valeriy.


On 04/27/2018 10:00 PM, Valeriy Avanesov wrote:
>
> Hi all,
>
> maybe I'm missing something, but from what was discussed here I've 
> gathered that the current mllib implementation returns exactly the 
> same model whether standardization is turned on or off.
>
> I suggest to consider an R script (please, see below) which trains two 
> penalized logistic regression models (with glmnet) with and without 
> standardization. The models are clearly different.
>
> Therefore, the current mllib implementation doesn't follow glmnet.
>
> library(glmnet)
> library(e1071)
>
> set.seed(13)
>
> # generate synthetic data
> X = cbind(-500:500, (-500:500)*1000)/1000
> y = sigmoid(X %*% c(1, 1))
> y = rbinom(y, 1, y)
>
> # define two testing points
> xTest = rbind(c(-10, -10000)/1000, c(-20, -20000)/1000)
>
> # train two models: with and without standartization
> lambda = 0.01
>
> model = glmnet(X, y, family="binomial", standardize=TRUE, lambda=lambda)
> print(predict(model, xTest, type="link"))
>
> model = glmnet(X, y, family="binomial", standardize=FALSE, lambda=lambda)
> print(predict(model, xTest, type="link"))
>
> Best,
>
> Valeriy.
>
>
> On 04/25/2018 12:32 AM, DB Tsai wrote:
>> As I’m one of the original authors, let me chime in for some comments.
>>
>> Without the standardization, the LBFGS will be unstable. For example, 
>> if a feature is being x 10, then the corresponding coefficient should 
>> be / 10 to make the same prediction. But without standardization, the 
>> LBFGS will converge to different solution due to numerical stability.
>>
>> TLDR, this can be implemented in the optimizer or in the trainer. We 
>> choose to implement in the trainer as LBFGS optimizer in breeze 
>> suffers this issue. As an user, you don’t need to care much even you 
>> have one-hot encoding features, and the result should match R.
>>
>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   
>> Apple, Inc
>>
>>> On Apr 20, 2018, at 5:56 PM, Weichen Xu <weichen.xu@databricks.com 
>>> <mailto:weichen.xu@databricks.com>> wrote:
>>>
>>> Right. If regularization item isn't zero, then enable/disable 
>>> standardization will get different result.
>>> But, if comparing results between R-glmnet and mllib, if we set the 
>>> same parameters for regularization/standardization/... , then we 
>>> should get the same result. If not, thenmaybe there's a bug. In this 
>>> case you can paste your testing code and I can help fix it.
>>>
>>> On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov <acopich@gmail.com 
>>> <mailto:acopich@gmail.com>> wrote:
>>>
>>>     Hi all.
>>>
>>>     Filipp, do you use l1/l2/elstic-net penalization? I believe in
>>>     this case standardization matters.
>>>
>>>     Best,
>>>
>>>     Valeriy.
>>>
>>>
>>>     On 04/17/2018 11:40 AM, Weichen Xu wrote:
>>>>     Not a bug.
>>>>
>>>>     When disabling standadization, mllib LR will still do
>>>>     standadization for features, but it will scale the coefficients
>>>>     back at the end (after training finished). So it will get the
>>>>     same result with no standadization training. The purpose of it
>>>>     is to improve the rate of convergence. So the result should be
>>>>     always exactly the same with R's glmnet, no matter enable or
>>>>     disable standadization.
>>>>
>>>>     Thanks!
>>>>
>>>>     On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang
>>>>     <ybliang8@gmail.com <mailto:ybliang8@gmail.com>> wrote:
>>>>
>>>>         Hi Filipp,
>>>>
>>>>         MLlib’s LR implementation did the same way as R’s glmnet
>>>>         for standardization.
>>>>         Actually you don’t need to care about the implementation
>>>>         detail, as the coefficients are always returned on the
>>>>         original scale, so it should be return the same result as
>>>>         other popular ML libraries.
>>>>         Could you point me where glmnet doesn’t scale features?
>>>>         I suspect other issues cause your prediction quality
>>>>         dropped. If you can share the code and data, I can help to
>>>>         check it.
>>>>
>>>>         Thanks
>>>>         Yanbo
>>>>
>>>>
>>>>>         On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin
>>>>>         <filipp.zhinkin@gmail.com
>>>>>         <mailto:filipp.zhinkin@gmail.com>> wrote:
>>>>>
>>>>>         Hi all,
>>>>>
>>>>>         While migrating from custom LR implementation to MLLib's
>>>>>         LR implementation my colleagues noticed that prediction
>>>>>         quality dropped (accoring to different business metrics).
>>>>>         It's turned out that this issue caused by features
>>>>>         standardization perfomed by MLLib's LR: disregard to
>>>>>         'standardization' option's value all features are scaled
>>>>>         during loss and gradient computation (as well as in few
>>>>>         other places):
>>>>>         https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>>>>>         <https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229>
>>>>>
>>>>>         According to comments in the code, standardization should
>>>>>         be implemented the same way it was implementes in R's
>>>>>         glmnet package. I've looked through corresponding Fortran
>>>>>         code, an it seems like glmnet don't scale features when
>>>>>         you're disabling standardisation (but MLLib still does).
>>>>>
>>>>>         Our models contains multiple one-hot encoded features and
>>>>>         scaling them is a pretty bad idea.
>>>>>
>>>>>         Why MLLib's LR always scale all features? From my POV it's
>>>>>         a bug.
>>>>>
>>>>>         Thanks in advance,
>>>>>         Filipp.
>>>>>
>>>>
>>>>
>>>
>>>
>>
>


Mime
View raw message