Github user jrdi commented on a diff in the pull request:
https://github.com/apache/spark/pull/7080#discussion_r171688655
 Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

@@ 534,27 +559,43 @@ private class LogisticCostFun(
case (aggregator1, aggregator2) => aggregator1.merge(aggregator2)
})
 // regVal is the sum of weight squares for L2 regularization
 val norm = if (regParamL2 == 0.0) {
 0.0
 } else if (fitIntercept) {
 brzNorm(Vectors.dense(weights.toArray.slice(0, weights.size 1)).toBreeze, 2.0)
 } else {
 brzNorm(weights, 2.0)
 }
 val regVal = 0.5 * regParamL2 * norm * norm
+ val totalGradientArray = logisticAggregator.gradient.toArray
 val loss = logisticAggregator.loss + regVal
 val gradient = logisticAggregator.gradient

 if (fitIntercept) {
 val wArray = w.toArray.clone()
 wArray(wArray.length  1) = 0.0
 axpy(regParamL2, Vectors.dense(wArray), gradient)
+ // regVal is the sum of weight squares excluding intercept for L2 regularization.
+ val regVal = if (regParamL2 == 0.0) {
+ 0.0
} else {
 axpy(regParamL2, w, gradient)
+ var sum = 0.0
+ w.foreachActive { (index, value) =>
+ // If `fitIntercept` is true, the last term which is intercept doesn't
+ // contribute to the regularization.
+ if (index != numFeatures) {
+ // The following code will compute the loss of the regularization; also
+ // the gradient of the regularization, and add back to totalGradientArray.
+ sum += {
+ if (standardization) {
+ totalGradientArray(index) += regParamL2 * value
+ value * value
+ } else {
+ if (featuresStd(index) != 0.0) {
+ // If `standardization` is false, we still standardize the data
+ // to improve the rate of convergence; as a result, we have to
+ // perform this reverse standardization by penalizing each component
+ // differently to get effectively the same objective function when
+ // the training dataset is not standardized.
+ val temp = value / (featuresStd(index) * featuresStd(index))
+ totalGradientArray(index) += regParamL2 * temp
+ value * temp
 End diff 
@dbtsai I understand this change was implemented years ago but I don't get why are you
creating `temp` as `value / std^2`. Is not supposed to be standardized? Shouldn't be `value
/ std`? In addition, `value * temp` is equivalent to `(value / std)^2` which makes more sense
to me.
Comparing it to the standardized method where we are adding `regParamL2 * standardizeValue`
to the gradient and 'returning' `standardizeValue^2`, here we are adding `regParamL2 * standardizeValue
/ std` and returning `standardizeValue^2`.
Does it makes sense? Am I missing some point here?


To unsubscribe, email: reviewsunsubscribe@spark.apache.org
For additional commands, email: reviewshelp@spark.apache.org
