drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (DRILL-1788) Conflicting column names in join
Date Sun, 14 Dec 2014 04:33:13 GMT

     [ https://issues.apache.org/jira/browse/DRILL-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aman Sinha reassigned DRILL-1788:
---------------------------------

    Assignee: Aman Sinha  (was: Jacques Nadeau)

> Conflicting column names in join
> --------------------------------
>
>                 Key: DRILL-1788
>                 URL: https://issues.apache.org/jira/browse/DRILL-1788
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Steven Phillips
>            Assignee: Aman Sinha
>             Fix For: 0.8.0
>
>         Attachments: 0001-DRILL-1788-Test-query-for-case-insensitive-join.-Fix.patch,
0001-Workaround-for-CALCITE-528-Convert-field-names-to-lo.patch
>
>
> Drill doesn't support multiple columns within a batch having the same name. when doing
a join where there are matching column names, the planner will insert a project to rename
one of the columns to avoid this conflict.
> However, it appears that there is some case-sensitive matching somewhere in the code
path, because there are some cases where this rewrite does not happen:
> For example, this query does do the column name change (see 01-03):
> 0: jdbc:drill:> explain plan for select n3.n_name from (select n2.n_name from cp.`tpch/nation.parquet`
n1, cp.`tpch/nation.parquet` n2 where n1.n_name = n2.n_name) n3 join cp.`tpch/nation.parquet`
n4 on n3.n_name = n4.n_name;
> {code}
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(n_name=[$0])
> 01-02          HashJoin(condition=[=($0, $1)], joinType=[inner])
> 01-04            HashToRandomExchange(dist0=[[$0]])
> 02-01              Project(n_name=[$1])
> 02-02                HashJoin(condition=[=($0, $1)], joinType=[inner])
> 02-04                  HashToRandomExchange(dist0=[[$0]])
> 04-01                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> 02-03                  Project(n_name0=[$0])
> 02-05                    HashToRandomExchange(dist0=[[$0]])
> 05-01                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> 01-03            Project(n_name0=[$0])
> 01-05              HashToRandomExchange(dist0=[[$0]])
> 03-01                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]],
selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> {code}
> But if I change the one of the letters in one of the identifiers to uppercase, the rename
goes away:
> {code}
> 0: jdbc:drill:> explain plan for select n3.n_name from (select n2.n_name from cp.`tpch/nation.parquet`
n1, cp.`tpch/nation.parquet` n2 where n1.N_name = n2.n_name) n3 join cp.`tpch/nation.parquet`
n4 on n3.n_name = n4.n_name;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(n_name=[$0])
> 01-02          HashJoin(condition=[=($0, $1)], joinType=[inner])
> 01-04            HashToRandomExchange(dist0=[[$0]])
> 02-01              Project(n_name=[$1])
> 02-02                HashJoin(condition=[=($0, $1)], joinType=[inner])
> 02-04                  HashToRandomExchange(dist0=[[$0]])
> 04-01                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]])
> 02-03                  Project(N_name0=[$0])
> 02-05                    HashToRandomExchange(dist0=[[$0]])
> 05-01                      Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]])
> 01-03            HashToRandomExchange(dist0=[[$0]])
> 03-01              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]],
selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]])
> {code}
> Running this query without the rewrite results in failure:
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
> 	at java.util.ArrayList.rangeCheck(ArrayList.java:604) ~[na:1.7.0_21]
> 	at java.util.ArrayList.get(ArrayList.java:382) ~[na:1.7.0_21]
> 	at org.apache.drill.exec.record.VectorContainer.getValueAccessorById(VectorContainer.java:252)
~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]
> 	at org.apache.drill.exec.record.AbstractRecordBatch.getValueAccessorById(AbstractRecordBatch.java:153)
~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]
> 	at org.apache.drill.exec.test.generated.HashJoinProbeGen249.doSetup(HashJoinProbeTemplate.java:46)
~[na:na]
> 	at org.apache.drill.exec.test.generated.HashJoinProbeGen249.setupHashJoinProbe(HashJoinProbeTemplate.java:97)
~[na:na]
> 	at org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext(HashJoinBatch.java:226)
~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message