drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (DRILL-1987) join with tons of duplicates hangs with hash join
Date Wed, 04 Feb 2015 22:20:35 GMT

     [ https://issues.apache.org/jira/browse/DRILL-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jacques Nadeau resolved DRILL-1987.
-----------------------------------
    Resolution: Not a Problem

The query generates a large intermediate result but appears to complete.

> join with tons of duplicates hangs with hash join
> -------------------------------------------------
>
>                 Key: DRILL-1987
>                 URL: https://issues.apache.org/jira/browse/DRILL-1987
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 0.8.0
>            Reporter: Chun Chang
>            Assignee: Aman Sinha
>
> #Fri Jan 09 20:39:31 EST 2015
> git.commit.id.abbrev=487d98e
> With hash join enabled (default), the following join query hangs (running for about 30
min now). The join condition has mostly duplicates. Each table has 1 million rows. Data can
be downloaded here:
> https://s3.amazonaws.com/apache-drill/files/complex.json.gz
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin`
= false;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.025 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin`
= true;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.045 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select a.id, b.gbyi, a.str from `complex.json`
a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit 20;
> +------------+------------+------------+
> |     id     |    gbyi    |    str     |
> +------------+------------+------------+
> {code}
> physical plan:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, b.gbyi,
a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit
20;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(id=[$0], gbyi=[$1], str=[$2])
> 00-02        SelectionVectorRemover
> 00-03          Limit(fetch=[20])
> 00-04            SingleMergeExchange(sort0=[0 ASC])
> 01-01              SelectionVectorRemover
> 01-02                TopN(limit=[20])
> 01-03                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    Project(id=[$1], gbyi=[$3], str=[$2])
> 02-02                      HashJoin(condition=[=($0, $3)], joinType=[inner])
> 02-04                        HashToRandomExchange(dist0=[[$0]])
> 03-01                          Project(gbyi=[$0], id=[$2], str=[$1])
> 03-02                            Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json,
numFiles=1, columns=[`gbyi`, `id`, `str`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> 02-03                        Project(gbyi0=[$0])
> 02-05                          HashToRandomExchange(dist0=[[$0]])
> 04-01                            Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json,
numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}
> If I turn merge join on, the query finishes rather quickly, like within a minute.
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin`
= false;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.026 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin`
= true;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, b.gbyi,
a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit
20;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(id=[$0], gbyi=[$1], str=[$2])
> 00-02        SelectionVectorRemover
> 00-03          Limit(fetch=[20])
> 00-04            SingleMergeExchange(sort0=[0 ASC])
> 01-01              SelectionVectorRemover
> 01-02                TopN(limit=[20])
> 01-03                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    Project(id=[$1], gbyi=[$3], str=[$2])
> 02-02                      MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 02-04                        SelectionVectorRemover
> 02-06                          Sort(sort0=[$0], dir0=[ASC])
> 02-08                            HashToRandomExchange(dist0=[[$0]])
> 03-01                              Project(gbyi=[$0], id=[$2], str=[$1])
> 03-02                                Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json,
numFiles=1, columns=[`gbyi`, `id`, `str`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> 02-03                        Project(gbyi0=[$0])
> 02-05                          SelectionVectorRemover
> 02-07                            Sort(sort0=[$0], dir0=[ASC])
> 02-09                              HashToRandomExchange(dist0=[[$0]])
> 04-01                                Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json,
numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message