spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@dbtsai.com>
Subject Re: FW: MLLIB (Spark) Question.
Date Wed, 17 Jun 2015 04:03:57 GMT
Hi Dhar,

For "standardization", we can disable it effectively by using
different regularization on each component. Thus, we're solving the
same problem but having better rate of convergence. This is one of the
features I will implement.

Sincerely,

DB Tsai
----------------------------------------------------------
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA)
<Sauptik.Dhar@us.bosch.com> wrote:
> Hi DB,
>
> Thank you for the reply. The answers makes sense. I do have just one more point to add.
>
> Note that it may be better to not implicitly standardize the data. Agreed that a number
of algorithms benefit from such standardization, but for many applications with contextual
information such standardization "may" not be desirable.
> Users can always perform the standardization themselves.
>
> However, that's just a suggestion. Again, thank you for the clarification.
>
> Thanks,
> Sauptik.
>
>
> -----Original Message-----
> From: DB Tsai [mailto:dbtsai@dbtsai.com]
> Sent: Tuesday, June 16, 2015 2:49 PM
> To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA)
> Cc: user@spark.apache.org
> Subject: Re: FW: MLLIB (Spark) Question.
>
> +cc user@spark.apache.org
>
> Reply inline.
>
> On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
> <Sauptik.Dhar> wrote:
>> Hi DB,
>>
>> Thank you for the reply. That explains a lot.
>>
>> I however had a few points regarding this:-
>>
>> 1. Just to help with the debate of not regularizing the b parameter. A standard implementation
argues against regularizing the b parameter. See Pg 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/
>>
>
> Agreed. We just worry about it will change behavior, but we actually
> have a PR to change the behavior to standard one,
> https://github.com/apache/spark/pull/6386
>
>> 2. Further, is the regularization of b also applicable for the SGD implementation.
Currently the SGD vs. BFGS implementations give different results (and both the implementations
don't match the IRLS algorithm). Are the SGD/BFGS implemented for different loss functions?
Can you please share your thoughts on this.
>>
>
> In SGD implementation, we don't "standardize" the dataset before
> training. As a result, those columns with low standard deviation will
> be penalized more, and those with high standard deviation will be
> penalized less. Also, "standardize" will help the rate of convergence.
> As a result, in most of package, they "standardize" the data
> implicitly, and get the weights in the "standardized" space, and
> transform back to original space so it's transparent for users.
>
> 1) LORWithSGD: No standardization, and penalize the intercept.
> 2) LORWithLBFGS: With standardization but penalize the intercept.
> 3) New LOR implementation: With standardization without penalizing the
> intercept.
>
> As a result, only the new implementation in Spark ML handles
> everything correctly. We have tests to verify that the results match
> R.
>
>>
>> @Naveen: Please feel free to add/comment on the above points as you see necessary.
>>
>> Thanks,
>> Sauptik.
>>
>> -----Original Message-----
>> From: DB Tsai
>> Sent: Tuesday, June 16, 2015 2:08 PM
>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>> Cc: Dhar Sauptik (CR/RTC1.3-NA)
>> Subject: Re: FW: MLLIB (Spark) Question.
>>
>> Hey,
>>
>> In the LORWithLBFGS api you use, the intercept is regularized while
>> other implementations don't regularize the intercept. That's why you
>> see the difference.
>>
>> The intercept should not be regularized, so we fix this in new Spark
>> ML api in spark 1.4. Since this will change the behavior in the old
>> api if we decide to not regularize the intercept in old version, we
>> are still debating about this.
>>
>> See the following code for full running example in spark 1.4
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala
>>
>> And also check out my talk at spark summit.
>> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
>> <Naveen.Ramakrishnan> wrote:
>>> Hi DB,
>>>     Hope you are doing well! One of my colleagues, Sauptik, is working with
>>> MLLib and the logistic regression based on LBFGS and is having trouble
>>> reproducing the same results when compared to Matlab. Please see below for
>>> details. I did take a look into this but seems like there’s also discrepancy
>>> between the logistic regression with SGD and LBFGS implementations in MLLib.
>>> We have attached all the codes for your analysis – it’s in PySpark though.
>>> Let us know if you have any questions or concerns. We would very much
>>> appreciate your help whenever you get a chance.
>>>
>>> Best,
>>> Naveen.
>>>
>>> _____________________________________________
>>> From: Dhar Sauptik (CR/RTC1.3-NA)
>>> Sent: Thursday, June 11, 2015 6:03 PM
>>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>>> Subject: MLLIB (Spark) Question.
>>>
>>>
>>> Hi Naveen,
>>>
>>> I am writing this owing to some MLLIB issues I found while using Logistic
>>> Regression. Basically, I am trying to test the stability of the L1/L2 –
>>> Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm
>>> the correctness of the algorithms. For comparison I implemented the
>>> L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the
>>> book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>> . Unfortunately the solutions don’t match:-
>>>
>>> For example:-
>>>
>>> Using the Publicly available data (diabetes.csv) for L2 regularized Logistic
>>> Regression (with lamda = 0.1) we get,
>>>
>>> Solutions
>>>
>>> MATLAB CODE (IRLS):-
>>>
>>> w = 0.294293470805555
>>> 0.550681766045083
>>> 0.0396336870148899
>>> 0.0641285712055971
>>> 0.101238592147879
>>> 0.261153541551578
>>> 0.178686710290069
>>>
>>> b=  -0.347396594061553
>>>
>>>
>>> MLLIB (SGD):-
>>> (weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714],
>>> intercept=-0.00749988882664631)
>>>
>>>
>>> MLLIB(LBFGS):-
>>> (weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394],
>>> intercept=-0.027401869113912316)
>>>
>>>
>>> All the codes are attached to the email.
>>>
>>> Apparently the solutions are quite far away from the optimal (and even from
>>> each other)! Can you please check with DB Tsai on the reasons for such
>>> differences? Note all the additional parameters are described in the source
>>> codes.
>>>
>>>
>>> Thanks,
>>> Best regards / Mit freundlichen Grüßen,
>>>
>>> Sauptik Dhar, Ph.D.
>>> CR/RTC1.3-NA
>>>
>>>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message