Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7B7DE17827 for ; Sun, 23 Aug 2015 04:34:46 +0000 (UTC) Received: (qmail 85819 invoked by uid 500); 23 Aug 2015 04:34:46 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 85782 invoked by uid 500); 23 Aug 2015 04:34:46 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 85636 invoked by uid 99); 23 Aug 2015 04:34:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Aug 2015 04:34:46 +0000 Date: Sun, 23 Aug 2015 04:34:45 +0000 (UTC) From: "Apache Spark (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (SPARK-10169) Evaluating AggregateFunction1 (old code path) may return wrong answers when grouping expressions are used as arguments of aggregate functions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10169: ------------------------------------ Assignee: Yin Huai (was: Apache Spark) > Evaluating AggregateFunction1 (old code path) may return wrong answers when grouping expressions are used as arguments of aggregate functions > --------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-10169 > URL: https://issues.apache.org/jira/browse/SPARK-10169 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.1.1, 1.2.2, 1.3.1, 1.4.1 > Reporter: Yin Huai > Assignee: Yin Huai > Priority: Critical > > Before Spark 1.5, if an aggregate function use an grouping expression as input argument, the result of the query can be wrong. The reason is we are using transformUp when we do aggregate results rewriting (see https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154). > To reproduce the problem, you can use > {code} > import org.apache.spark.sql.functions._ > sc.parallelize((1 to 1000), 50).map(i => Tuple1(i)).toDF("i").registerTempTable("t") > sqlContext.sql(""" > select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i) > from t > where i % 10 = 5 > group by i % 10""").explain() > == Physical Plan == > Aggregate false, [PartialGroup#234], [PartialGroup#234 AS _c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234 = 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS _c2#227L] > Exchange (HashPartitioning [PartialGroup#234], 200) > Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191 % 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L] > Project [_1#190 AS i#191] > Filter ((_1#190 % 10) = 5) > PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at ExistingRDD.scala:37 > sqlContext.sql(""" > select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i) > from t > where i % 10 = 5 > group by i % 10""").show > _c0 _c1 _c2 > 5 50 100 > {code} > In Spark 1.5, new aggregation code path does not have the problem. The old code path is fixed by https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org