drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2735) Broadcast plan get's "lost" when the same query is used in UNION ALL
Date Thu, 02 Jul 2015 22:00:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612596#comment-14612596
] 

Jinfeng Ni commented on DRILL-2735:
-----------------------------------

I do not have the j1 data. But I tried with a different dataset, and the plan with UNION all
does have broadcast-based join. Could you either please upload the sample data you use to
re-produce the issue, or re-try and update the status?

For now, I'm going to mark it as "fixed".

{code}
alter session set `planner.slice_target` = 1;
{code}

{code}
explain plan for select j1.c_integer from j1, j2 where j1.c_integer = j2.c_integer;

00-00    Screen
00-01      UnionExchange
01-01        Project(c_integer=[$0])
01-02          HashJoin(condition=[=($0, $1)], joinType=[inner])
01-03            Project(c_integer0=[$0])
01-05              BroadcastExchange
02-01                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/j2]],
selectionRoot=file:/tmp/j2, numFiles=1, columns=[`c_integer`]]])
01-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/j1]],
selectionRoot=file:/tmp/j1, numFiles=1, columns=[`c_integer`]]])
{code}

Query with UNION ALL : Broadcast-join is used in the plan.

{code}
explain plan for select j1.c_integer from j1, j2 where j1.c_integer = j2.c_integer union all
select j1.c_integer from j1, j2 where j1.c_integer = j2.c_integer;

00-00    Screen
00-01      UnionAll(all=[true])
00-03        Project(c_integer=[$0])
00-05          HashJoin(condition=[=($0, $1)], joinType=[inner])
00-08            Project(c_integer0=[$0])
00-11              BroadcastExchange
01-01                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/j2]],
selectionRoot=file:/tmp/j2, numFiles=1, columns=[`c_integer`]]])
00-09            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/j1]],
selectionRoot=file:/tmp/j1, numFiles=1, columns=[`c_integer`]]])
00-02        Project(c_integer=[$0])
00-04          HashJoin(condition=[=($0, $1)], joinType=[inner])
00-06            Project(c_integer0=[$0])
00-10              BroadcastExchange
02-01                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/j2]],
selectionRoot=file:/tmp/j2, numFiles=1, columns=[`c_integer`]]])
00-07            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/j1]],
selectionRoot=file:/tmp/j1, numFiles=1, columns=[`c_integer`]]])
{code}




> Broadcast plan get's "lost" when the same query is used in UNION ALL
> --------------------------------------------------------------------
>
>                 Key: DRILL-2735
>                 URL: https://issues.apache.org/jira/browse/DRILL-2735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.9.0
>            Reporter: Victoria Markman
>            Assignee: Jinfeng Ni
>             Fix For: 1.2.0
>
>
> I get a broadcast plan for simple inner join query.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select j1.c_integer from j1, j2 where j1.c_integer
= j2.c_integer;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(c_integer=[$0])
> 01-02          HashJoin(condition=[=($0, $1)], joinType=[inner])
> 01-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/ctas/j1]],
selectionRoot=/drill/testdata/ctas/j1, numFiles=1, columns=[`c_integer`]]])
> 01-03            Project(c_integer0=[$0])
> 01-05              BroadcastExchange
> 02-01                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/ctas/j2]],
selectionRoot=/drill/testdata/ctas/j2, numFiles=1, columns=[`c_integer`]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ {
>       "name" : "planner.broadcast_factor",
>       "kind" : "DOUBLE",
>       "type" : "SESSION",
>       "float_val" : 0.0
>     }, {
>       "name" : "planner.slice_target",
>       "kind" : "LONG",
>       "type" : "SESSION",
>       "num_val" : 1
>     } ],
> {code}
> Create table succeeds and multiple fragments are executed:
> {code}
> 0: jdbc:drill:schema=dfs> create table test(a1) as  select j1.c_integer from j1, j2
where j1.c_integer = j2.c_integer;
> +------------+---------------------------+
> |  Fragment  | Number of records written |
> +------------+---------------------------+
> | 1_1        | 0                         |
> | 1_3        | 0                         |
> | 1_31       | 0                         |
> | 1_43       | 0                         |
> | 1_35       | 0                         |
> | 1_21       | 0                         |
> | 1_19       | 0                         |
> | 1_27       | 1                         |
> | 1_17       | 1                         |
> | 1_13       | 0                         |
> | 1_29       | 0                         |
> | 1_33       | 0                         |
> | 1_25       | 0                         |
> | 1_7        | 0                         |
> | 1_11       | 0                         |
> | 1_37       | 0                         |
> | 1_45       | 0                         |
> | 1_9        | 0                         |
> | 1_23       | 1                         |
> | 1_15       | 0                         |
> | 1_41       | 0                         |
> | 1_39       | 0                         |
> | 1_5        | 0                         |
> | 1_10       | 0                         |
> | 1_14       | 0                         |
> | 1_24       | 0                         |
> | 1_16       | 0                         |
> | 1_12       | 0                         |
> | 1_36       | 0                         |
> | 1_20       | 0                         |
> | 1_34       | 1                         |
> | 1_40       | 0                         |
> | 1_22       | 0                         |
> | 1_26       | 0                         |
> | 1_32       | 1                         |
> | 1_8        | 0                         |
> | 1_18       | 0                         |
> | 1_42       | 0                         |
> | 1_44       | 0                         |
> | 1_38       | 0                         |
> | 1_30       | 0                         |
> | 1_28       | 1                         |
> | 1_4        | 10                        |
> | 1_2        | 1                         |
> | 1_6        | 0                         |
> | 1_0        | 0                         |
> +------------+---------------------------+
> 46 rows selected (2.337 seconds)
> {code}
> 8 parquet files are written:
> {code}
> [Wed Apr 08 11:41:10 root@/mapr/vmarkman.cluster.com/drill/testdata/ctas/test ] # ls
-ltr
> total 4
> -rwxr-xr-x 1 mapr mapr 146 Apr  8 11:40 1_17_0.parquet
> -rwxr-xr-x 1 mapr mapr 146 Apr  8 11:40 1_27_0.parquet
> -rwxr-xr-x 1 mapr mapr 146 Apr  8 11:40 1_23_0.parquet
> -rwxr-xr-x 1 mapr mapr 146 Apr  8 11:40 1_34_0.parquet
> -rwxr-xr-x 1 mapr mapr 185 Apr  8 11:40 1_4_0.parquet
> -rwxr-xr-x 1 mapr mapr 146 Apr  8 11:40 1_32_0.parquet
> -rwxr-xr-x 1 mapr mapr 146 Apr  8 11:40 1_28_0.parquet
> -rwxr-xr-x 1 mapr mapr 146 Apr  8 11:40 1_2_0.parquet
> {code}
> However:
> 1. broadcast join is not planned when the same join query used in legs of the union all
oprartor:
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select j1.c_integer from j1, j2 where j1.c_integer
= j2.c_integer union all select j1.c_integer from j1, j2 where j1.c_integer = j2.c_integer;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      UnionAll(all=[true])
> 00-03        Project(c_integer=[$0])
> 00-05          HashJoin(condition=[=($0, $1)], joinType=[inner])
> 00-09            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/ctas/j1]],
selectionRoot=/drill/testdata/ctas/j1, numFiles=1, columns=[`c_integer`]]])
> 00-08            Project(c_integer0=[$0])
> 00-11              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/ctas/j2]],
selectionRoot=/drill/testdata/ctas/j2, numFiles=1, columns=[`c_integer`]]])
> 00-02        Project(c_integer=[$0])
> 00-04          HashJoin(condition=[=($0, $1)], joinType=[inner])
> 00-07            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/ctas/j1]],
selectionRoot=/drill/testdata/ctas/j1, numFiles=1, columns=[`c_integer`]]])
> 00-06            Project(c_integer0=[$0])
> 00-10              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/ctas/j2]],
selectionRoot=/drill/testdata/ctas/j2, numFiles=1, columns=[`c_integer`]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ {
>       "name" : "planner.broadcast_factor",
>       "kind" : "DOUBLE",
>       "type" : "SESSION",
>       "float_val" : 0.0
>     }, {
>       "name" : "planner.slice_target",
>       "kind" : "LONG",
>       "type" : "SESSION",
>       "num_val" : 1
>     } ],
> {code}
> 2. CTAS with this query produces 1 parquet file:
> {code}
> [Thu Apr 09 09:13:26 root@/mapr/vmarkman.cluster.com/drill/testdata/ctas/yyy ] # ls
> 0_0_0.parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message