spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wayne Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM
Date Mon, 05 Dec 2016 06:43:59 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wayne Zhang updated SPARK-18715:
--------------------------------
    Summary: Fix wrong AIC calculation in Binomial GLM  (was: Correct AIC calculation in Binomial
GLM)

> Fix wrong AIC calculation in Binomial GLM
> -----------------------------------------
>
>                 Key: SPARK-18715
>                 URL: https://issues.apache.org/jira/browse/SPARK-18715
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.2
>            Reporter: Wayne Zhang
>            Priority: Critical
>              Labels: patch
>             Fix For: 2.2.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The AIC calculation in Binomial GLM seems to be wrong when there are weights. The weight
adjustment should be applied to only the part of the Binomial density involving the parameters,
not the normalizing constant. 
> The current implementation is:
> {code}
>       -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
>         weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
>       }.sum()
> {code} 
> Suggest changing this to 
> {code}
>       -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
>         val wt = math.round(weight).toInt
>         if (wt == 0){
>           0.0
>         } else {
>           dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
>         }
>       }.sum()
> {code} 
> ----
> ----
> The following is an example to illustrate the problem.
> {code}
> val dataset = Seq(
>       LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>       LabeledPoint(0.5, Vectors.dense(12, 0.0)),
>       LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>       LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>       LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>       LabeledPoint(0.5, Vectors.dense(16, 1.0))
>     ).toDF().withColumn("weight", col("label") + 1.0)
> val glr = new GeneralizedLinearRegression()
>     .setFamily("binomial")
>     .setWeightCol("weight")
>     .setRegParam(0)
> val model = glr.fit(dataset)
> model.summary.aic
> {code}
> This calculation shows the AIC is 14.189026847171382. To verify whether this is correct,
I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918. 
> {code}
> da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
> 0,18,1,1
> 0.5,12,0,1.5
> 1,15,0,2
> 0,13,2,1
> 0,15,1,1
> 0.5,16,1,1.5
> da <- as.data.frame(da)
> f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
> AIC(f)
> -2 * logLik(f)
> {code}
> Now, I check whether the proposed change is correct. The following calculates -2 * LogLik
manually and get 5.6609177228379055, the same as that in R.
> {code}
> val predictions = model.transform(dataset)
> -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double,
mu: Double, weight: Double) =>
>       val wt = math.round(weight).toInt
>       if (wt == 0){
>         0.0
>       } else {
>         dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
>       }
>   }.sum()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message