drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2010) merge join returns wrong number of rows with large dataset
Date Thu, 15 Jan 2015 00:45:34 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277995#comment-14277995
] 

Aman Sinha commented on DRILL-2010:
-----------------------------------

Note that with merge join, there are 4 column values that produce incorrect matches below:
 (correct number of matches for all values is 100).  
{code}
select a.gbyi, count(*) cnt from dfs.`/Users/asinha/data/complex10k.json` a inner join dfs.`/Users/asinha/data/complex10k.json`
b on a.gbyi=b.gbyi group by a.gbyi order by cnt limit 20;

gbyi	cnt
820	12
410	40
852	60
902	92
6	100
7	100
3	100
10	100
5	100
11	100
12	100
13	100
14	100
15	100
18	100
9	100
17	100
19	100
20	100
16	100
Total rows returned : 20
{code}



> merge join returns wrong number of rows with large dataset
> ----------------------------------------------------------
>
>                 Key: DRILL-2010
>                 URL: https://issues.apache.org/jira/browse/DRILL-2010
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Operators
>    Affects Versions: 0.8.0
>            Reporter: Chun Chang
>            Assignee: Aman Sinha
>
> #Mon Jan 12 18:19:31 EST 2015
> git.commit.id.abbrev=5b012bf
> When data set is big enough (like larger than one batch size), merge join will not returns
the correct number of rows. Hash join returns the correct number of rows. Data can be downloaded
from:
> https://s3.amazonaws.com/apache-drill/files/complex100k.json.gz
> With this dataset, the following query should return 10,000,000. 
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin`
= true;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin`
= false;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from `complex100k.json`
a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> +------------+
> |   EXPR$0   |
> +------------+
> | 9046760    |
> +------------+
> 1 row selected (6.205 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin`
= false;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.026 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin`
= true;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from `complex100k.json`
a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> +------------+
> |   EXPR$0   |
> +------------+
> | 10000000   |
> +------------+
> 1 row selected (4.453 seconds)
> {code}
> With smaller dataset, both merge and hash join returns the same correct number.
> physical plan for merge join:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select count(a.id)
from `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      StreamAgg(group=[{}], EXPR$0=[COUNT($0)])
> 00-02        Project(id=[$1])
> 00-03          MergeJoin(condition=[=($0, $2)], joinType=[inner])
> 00-05            SelectionVectorRemover
> 00-07              Sort(sort0=[$0], dir0=[ASC])
> 00-09                Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex100k.json,
numFiles=1, columns=[`gbyi`, `id`], files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> 00-04            Project(gbyi0=[$0])
> 00-06              SelectionVectorRemover
> 00-08                Sort(sort0=[$0], dir0=[ASC])
> 00-10                  Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex100k.json,
numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message