spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mitesh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
Date Thu, 18 May 2017 13:49:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM:
---------------------------------------------------------

I'm seeing a regression from this change, the last {del <> 'hi'} filter gets pushed
down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
    val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
      .toDF("userid", "eventid", "vk", "del")
      .filter("userid is not null and eventid is not null and vk is not null")
      .repartitionByColumns(Seq("userid"))
      .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
      .dropDuplicates("eventid")
      .filter("userid is not null")
      .repartitionByColumns(Seq("userid")).
      sortWithinPartitions(asc("userid"))
      .filter("del <> 'hi'")

    // filter should not be pushed down to the local table scan
    df.queryExecution.sparkPlan.collect {
      case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
        assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates
aggregation. cc [~cloud_fan]
 
{code:none}
    val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
      .toDF("userid", "eventid", "vk", "del")
      .filter("userid is not null and eventid is not null and vk is not null")
      .repartitionByColumns(Seq("userid"))
      .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
      .dropDuplicates("eventid")
      .filter("userid is not null")
      .repartitionByColumns(Seq("userid")).
      sortWithinPartitions(asc("userid"))
      .filter("del <> 'hi'")

    // filter should not be pushed down to the local table scan
    df.queryExecution.sparkPlan.collect {
      case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
        assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-17867
>                 URL: https://issues.apache.org/jira/browse/SPARK-17867
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Liang-Chi Hsieh
>            Assignee: Liang-Chi Hsieh
>             Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given column name in
Dataset.dropDuplicates. When we have the more than one columns with the same name. Other columns
are put into aggregation columns, instead of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message