Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 19176200BCF for ; Mon, 5 Dec 2016 07:44:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 17B62160B17; Mon, 5 Dec 2016 06:44:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 601A3160AF9 for ; Mon, 5 Dec 2016 07:44:00 +0100 (CET) Received: (qmail 46207 invoked by uid 500); 5 Dec 2016 06:43:59 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 46198 invoked by uid 99); 5 Dec 2016 06:43:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2016 06:43:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 3CE142C0086 for ; Mon, 5 Dec 2016 06:43:59 +0000 (UTC) Date: Mon, 5 Dec 2016 06:43:59 +0000 (UTC) From: "Wayne Zhang (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 05 Dec 2016 06:44:01 -0000 [ https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang updated SPARK-18715: -------------------------------- Summary: Fix wrong AIC calculation in Binomial GLM (was: Correct AIC calculation in Binomial GLM) > Fix wrong AIC calculation in Binomial GLM > ----------------------------------------- > > Key: SPARK-18715 > URL: https://issues.apache.org/jira/browse/SPARK-18715 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.0.2 > Reporter: Wayne Zhang > Priority: Critical > Labels: patch > Fix For: 2.2.0 > > Original Estimate: 120h > Remaining Estimate: 120h > > The AIC calculation in Binomial GLM seems to be wrong when there are weights. The weight adjustment should be applied to only the part of the Binomial density involving the parameters, not the normalizing constant. > The current implementation is: > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) > }.sum() > {code} > Suggest changing this to > {code} > -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} > ---- > ---- > The following is an example to illustrate the problem. > {code} > val dataset = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(0.5, Vectors.dense(12, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(0.5, Vectors.dense(16, 1.0)) > ).toDF().withColumn("weight", col("label") + 1.0) > val glr = new GeneralizedLinearRegression() > .setFamily("binomial") > .setWeightCol("weight") > .setRegParam(0) > val model = glr.fit(dataset) > model.summary.aic > {code} > This calculation shows the AIC is 14.189026847171382. To verify whether this is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik = 5.660918. > {code} > da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",") > 0,18,1,1 > 0.5,12,0,1.5 > 1,15,0,2 > 0,13,2,1 > 0,15,1,1 > 0.5,16,1,1.5 > da <- as.data.frame(da) > f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w) > AIC(f) > -2 * logLik(f) > {code} > Now, I check whether the proposed change is correct. The following calculates -2 * LogLik manually and get 5.6609177228379055, the same as that in R. > {code} > val predictions = model.transform(dataset) > -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case Row(y: Double, mu: Double, weight: Double) => > val wt = math.round(weight).toInt > if (wt == 0){ > 0.0 > } else { > dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) > } > }.sum() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org