drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2488) Wrong result on join between two subqueries with aggregation
Date Wed, 18 Mar 2015 00:17:39 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366370#comment-14366370
] 

Aman Sinha commented on DRILL-2488:
-----------------------------------

Adding a simplified query that does not do COUNT(distinct) and no GROUP-BY and still manifests
the problem: 
 {code}
select * from
  ( select b1 from dfs.`/Users/asinha/data/t1.parquet` order by b1 limit 5 offset 3) as sq1(x1)
     inner join 
  ( select b1 from dfs.`/Users/asinha/data/t1.parquet` order by b1 limit 5 offset 3) as sq2(x1)
      on sq1.x1 = sq2.x1
 ;
{code}
With HashJoin plan, this produces 5 rows (correct).  
With MergeJoin plan, this produces 2 rows (wrong).   However, I don't think this is an issue
with MergeJoin; it seems to be related to OFFSET.  Removing the offset produces correct results
and changing it produces different wrong results.  I will investigate some more. 

> Wrong result on join between two subqueries with aggregation
> ------------------------------------------------------------
>
>                 Key: DRILL-2488
>                 URL: https://issues.apache.org/jira/browse/DRILL-2488
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 0.8.0
>            Reporter: Victoria Markman
>            Assignee: Chris Westin
>            Priority: Critical
>             Fix For: 0.9.0
>
>         Attachments: t1.parquet
>
>
> {code}
> 0: jdbc:drill:schema=dfs> select * from t1;
> +------------+------------+------------+
> |     a1     |     b1     |     c1     |
> +------------+------------+------------+
> | 1          | aaaaa      | 2015-01-01 |
> | 2          | bbbbb      | 2015-01-02 |
> | 3          | ccccc      | 2015-01-03 |
> | 4          | null       | 2015-01-04 |
> | 5          | eeeee      | 2015-01-05 |
> | 6          | fffff      | 2015-01-06 |
> | 7          | ggggg      | 2015-01-07 |
> | null       | hhhhh      | 2015-01-08 |
> | 9          | iiiii      | null       |
> | 10         | jjjjj      | 2015-01-10 |
> +------------+------------+------------+
> 10 rows selected (0.15 seconds)
> {code}
> This result is incorrect, one row is missing
> {code}
> 0: jdbc:drill:schema=dfs> select * from
> . . . . . . . . . . . . > (
> . . . . . . . . . . . . >                 select
> . . . . . . . . . . . . >                         b1,
> . . . . . . . . . . . . >                         count(distinct a1)
> . . . . . . . . . . . . >                 from
> . . . . . . . . . . . . >                         t1
> . . . . . . . . . . . . >                 group by
> . . . . . . . . . . . . >                         b1
> . . . . . . . . . . . . >                 order by
> . . . . . . . . . . . . >                         b1 limit 5 offset 1
> . . . . . . . . . . . . >         ) as sq1(x1, y1)
> . . . . . . . . . . . . >
> . . . . . . . . . . . . >         inner join
> . . . . . . . . . . . . >
> . . . . . . . . . . . . >         (
> . . . . . . . . . . . . >                 select
> . . . . . . . . . . . . >                         b1,
> . . . . . . . . . . . . >                         count(distinct a1)
> . . . . . . . . . . . . >                 from
> . . . . . . . . . . . . >                         t1
> . . . . . . . . . . . . >                 group by
> . . . . . . . . . . . . >                         b1
> . . . . . . . . . . . . >                 order by
> . . . . . . . . . . . . >                         b1 limit 5 offset 1
> . . . . . . . . . . . . >         ) as sq2(x1, y1)
> . . . . . . . . . . . . >         on
> . . . . . . . . . . . . >                 sq1.x1 = sq2.x1 and
> . . . . . . . . . . . . >                 sq2.y1 = sq2.y1
> . . . . . . . . . . . . > ;
> +------------+------------+------------+------------+
> |     x1     |     y1     |    x10     |    y10     |
> +------------+------------+------------+------------+
> | bbbbb      | 1          | bbbbb      | 1          |
> | ccccc      | 1          | ccccc      | 1          |
> | eeeee      | 1          | eeeee      | 1          |
> | fffff      | 1          | fffff      | 1          |
> +------------+------------+------------+------------+
> 4 rows selected (0.28 seconds)
> {code}
> Explain plan for the wrong result:
> {code}
> 00-01      Project(x1=[$0], y1=[$1], x10=[$2], y10=[$3])
> 00-02        Project(x1=[$0], y1=[$1], x10=[$2], y10=[$3])
> 00-03          MergeJoin(condition=[=($0, $2)], joinType=[inner])
> 00-05            Limit(offset=[1], fetch=[5])
> 00-07              StreamAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 00-09                Sort(sort0=[$0], dir0=[ASC])
> 00-11                  StreamAgg(group=[{0, 1}])
> 00-13                    Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-15                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=maprfs:/drill/testdata/aggregation/t1]], selectionRoot=/drill/testdata/aggregation/t1,
numFiles=1, columns=[`b1`, `a1`]]])
> 00-04            Project(b10=[$0], EXPR$10=[$1])
> 00-06              SelectionVectorRemover
> 00-08                Sort(sort0=[$0], dir0=[ASC])
> 00-10                  Filter(condition=[=($1, $1)])
> 00-12                    Limit(offset=[1], fetch=[5])
> 00-14                      StreamAgg(group=[{0}], EXPR$1=[COUNT($1)])
> 00-16                        Sort(sort0=[$0], dir0=[ASC])
> 00-17                          StreamAgg(group=[{0, 1}])
> 00-18                            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-19                              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=maprfs:/drill/testdata/aggregation/t1]], selectionRoot=/drill/testdata/aggregation/t1,
numFiles=1, columns=[`b1`, `a1`]]])
> {code}
> If you turn off merge join, query returns correct result:
> {code}
> 0: jdbc:drill:schema=dfs> select * from
> . . . . . . . . . . . . > (
> . . . . . . . . . . . . >     select
> . . . . . . . . . . . . >         b1,
> . . . . . . . . . . . . >         count(distinct a1)
> . . . . . . . . . . . . >     from
> . . . . . . . . . . . . >         t1
> . . . . . . . . . . . . >     group by
> . . . . . . . . . . . . >         b1
> . . . . . . . . . . . . >     order by
> . . . . . . . . . . . . >         b1 limit 5 offset 1
> . . . . . . . . . . . . >  ) as sq1(x1, y1)
> . . . . . . . . . . . . >
> . . . . . . . . . . . . >         inner join
> . . . . . . . . . . . . > (
> . . . . . . . . . . . . >     select
> . . . . . . . . . . . . >         b1,
> . . . . . . . . . . . . >         count(distinct a1)
> . . . . . . . . . . . . >     from
> . . . . . . . . . . . . >         t1
> . . . . . . . . . . . . >     group by
> . . . . . . . . . . . . >         b1
> . . . . . . . . . . . . >     order by
> . . . . . . . . . . . . >           b1 limit 5 offset 1
> . . . . . . . . . . . . > ) as sq2(x1, y1)
> . . . . . . . . . . . . > on
> . . . . . . . . . . . . >                 sq1.x1 = sq2.x1 and
> . . . . . . . . . . . . >                 sq2.y1 = sq2.y1
> . . . . . . . . . . . . > ;
> +------------+------------+------------+------------+
> |     x1     |     y1     |    x10     |    y10     |
> +------------+------------+------------+------------+
> | bbbbb      | 1          | bbbbb      | 1          |
> | ccccc      | 1          | ccccc      | 1          |
> | eeeee      | 1          | eeeee      | 1          |
> | fffff      | 1          | fffff      | 1          |
> | ggggg      | 1          | ggggg      | 1          |
> +------------+------------+------------+------------+
> 5 rows selected (0.352 seconds)
> {code}
> cut/paste reproduction
> {code:sql}
> select * from
>  (
>                  select 
>                          b1,                      
>                          count(distinct a1)       
>                  from 
>                          t1                       
>                  group by                 
>                          b1                       
>                  order by                 
>                          b1 limit 5 offset 1      
>          ) as sq1(x1, y1)         
>   
>          inner join 
>   
>          (
>                  select 
>                          b1,                      
>                          count(distinct a1)       
>                  from 
>                          t1                       
>                  group by                 
>                          b1                       
>                  order by                 
>                          b1 limit 5 offset 1      
>          ) as sq2(x1, y1)         
>          on 
>                  sq1.x1 = sq2.x1 and      
>                  sq2.y1 = sq2.y1          
>  ;
> {code}
> This test has been running and passing in regression test suite until framework was switched
to a new code, where JSON parsing was replaced with jackson  and for a brief period ( I believe
Friday afternoon until now ) this suite was not executed.
> We already have a merge join bug DRILL-2010, but this one seems to be different (small
data set) and feels like a recent regression. 
> For QA: test Functional/Passing/aggregation/sanity/q18.sql will be running with merge
join disabled until this issue is fixed. Will need to remove alter session from this file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message