spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Franck Tago (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
Date Tue, 04 Oct 2016 00:48:21 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543926#comment-15543926
] 

Franck Tago commented on SPARK-17758:
-------------------------------------

is there any workaround that you could think of?  

One  simple solution cold be to call repartition on the data frame  in order to remove the
empty partition but performance wise that is  just terrible .  

I am not sure that i comprehend why the concept of an empty partition is even allowed in the
spark ecosystem .  Any documentation on why empty partitions are allowed would be greatly
appreciated.

> Spark Aggregate function  LAST returns null on an empty partition 
> ------------------------------------------------------------------
>
>                 Key: SPARK-17758
>                 URL: https://issues.apache.org/jira/browse/SPARK-17758
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0
>         Environment: Spark 2.0.0
>            Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I believe is
due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty partition
should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) AS field1#103,
cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>    +- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], output=[max(C3_0)#50,last(C3_1)#51])
>       +- SortAggregate(key=[], functions=[partial_max(C3_0#40),partial_last(C3_1#41,
false)], output=[max#91,last#92])
>          +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 AS C3_1#41]
>             +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as bigint)),last(C1_1#18,
false)], output=[CAST(sum(C1_0) AS DOUBLE)#27,last(C1_1)#28])
>                +- Exchange SinglePartition
>                   +- SortAggregate(key=[], functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18,
false)], output=[sum#95L,last#96])
>                      +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
>                         +- HiveTableScan [field1#7, field#6], MetastoreRelation default,
bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message