Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Thu, 16 Apr 2015 17:51:59 +0000 (UTC)
From: "Victoria Markman (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.12821005.1429052556000.20885.1429206719360@Atlassian.JIRA>
In-Reply-To: <JIRA.12821005.1429052556000@Atlassian.JIRA>
References: <JIRA.12821005.1429052556000@Atlassian.JIRA>
 <JIRA.12821005.1429052556562@arcas>
Subject: [jira] [Commented] (DRILL-2794) Partition pruning is not happening
 correctly (results in a full table scan) when maxdir/mindir is used in the
 filter condition
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/DRILL-2794?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1449=
8378#comment-14498378 ]=20

Victoria Markman commented on DRILL-2794:
-----------------------------------------

After further investigation it turned out that my files were very small (be=
low 1K) and scan was costed as 0.
At the moment I increased one of the files size to be larger than 1K, I got=
 behavior that was expected: scan of just one file.
This issue is a duplicate of https://issues.apache.org/jira/browse/DRILL-25=
53

> Partition pruning is not happening correctly (results in a full table sca=
n) when maxdir/mindir is used in the filter condition
> -------------------------------------------------------------------------=
-----------------------------------------------------
>
>                 Key: DRILL-2794
>                 URL: https://issues.apache.org/jira/browse/DRILL-2794
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.9.0
>            Reporter: Victoria Markman
>            Assignee: Victoria Markman
>
> Directory structure:
> {code}
> [Tue Apr 14 13:43:54 root@/mapr/vmarkman.cluster.com/test/smalltable ] # =
ls -R
> .:
> 2014  2015  2016
> ./2014:
> ./2015:
> 01  02
> ./2015/01:
> t1.csv
> ./2015/02:
> t2.csv
> ./2016:
> t1.csv
> [Tue Apr 14 13:44:26 root@/mapr/vmarkman.cluster.com/test/bigtable ] # ls=
 -R
> .:
> 2015  2016
> ./2015:
> 01  02  03  04
> ./2015/01:
> 0_0_0.parquet  1_0_0.parquet  2_0_0.parquet  3_0_0.parquet  4_0_0.parquet=
  5_0_0.parquet
> ./2015/02:
> 0_0_0.parquet
> ./2015/03:
> 0_0_0.parquet
> ./2015/04:
> 0_0_0.parquet
> ./2016:
> 01  parquet.file
> ./2016/01:
> 0_0_0.parquet
> {code}
> Simple case, partition pruning is happening correctly: only 2016 director=
y is scanned from 'smalltable'.
> {code}
> 0: jdbc:drill:schema=3Ddfs> explain plan for select * from smalltable whe=
re dir0 =3D maxdir('dfs.test', 'bigtable');
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(*=3D[$0])
> 00-02        Project(*=3D[$0])
> 00-03          Scan(groupscan=3D[EasyGroupScan [selectionRoot=3D/test/sma=
lltable, numFiles=3D1, columns=3D[`*`], files=3D[maprfs:/test/smalltable/20=
16/t1.csv]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 3,
>     "files" : [ "maprfs:/test/smalltable/2016/t1.csv" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "maprfs:///",
>       "workspaces" : {
>         "root" : {
>           "location" : "/",
>           "writable" : false,
>           "defaultInputFormat" : null
>         },
> ...
> ...
> {code}
> With added second predicate (dir1 =3D mindir('dfs.test', 'bigtable/2016')=
 which evaluates to false (there is no directory '01' in smalltable)
> we end up scanning everything in the smalltable. This does not look right=
 to me and I think this is a bug.
> {code}
> 0: jdbc:drill:schema=3Ddfs> explain plan for select * from smalltable whe=
re dir0 =3D maxdir('dfs.test', 'bigtable') and dir1 =3D mindir('dfs.test', =
'bigtable/2016');
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(*=3D[$0])
> 00-02        Project(T15=C2=A6=C2=A6*=3D[$0])
> 00-03          SelectionVectorRemover
> 00-04            Filter(condition=3D[AND(=3D($1, '2016'), =3D($2, '01'))]=
)
> 00-05              Project(T15=C2=A6=C2=A6*=3D[$0], dir0=3D[$1], dir1=3D[=
$2])
> 00-06                Scan(groupscan=3D[EasyGroupScan [selectionRoot=3D/te=
st/smalltable, numFiles=3D3, columns=3D[`*`], files=3D[maprfs:/test/smallta=
ble/2015/01/t1.csv, maprfs:/test/smalltable/2015/02/t2.csv, maprfs:/test/sm=
alltable/2016/t1.csv]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "@id" : 6,
>     "files" : [ "maprfs:/test/smalltable/2015/01/t1.csv", "maprfs:/test/s=
malltable/2015/02/t2.csv", "maprfs:/test/smalltable/2016/t1.csv" ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "maprfs:///",
>       "workspaces" : {
>         "root" : {
>           "location" : "/",
>           "writable" : false,
>           "defaultInputFormat" : null
>         },
> ...
> ...
> {code}
> Here is a similar example with parquet file where predicate "a1=3D11" eva=
luates to false.
> {code}
> 0: jdbc:drill:schema=3Ddfs> explain plan for select * from bigtable where=
 dir0=3Dmaxdir('dfs.test','bigtable') and a1 =3D 11;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(*=3D[$0])
> 00-02        Project(T25=C2=A6=C2=A6*=3D[$0])
> 00-03          SelectionVectorRemover
> 00-04            Filter(condition=3D[AND(=3D($1, '2016'), =3D($2, 11))])
> 00-05              Project(T25=C2=A6=C2=A6*=3D[$0], dir0=3D[$1], a1=3D[$2=
])
> 00-06                Scan(groupscan=3D[ParquetGroupScan [entries=3D[ReadE=
ntryWithPath [path=3Dmaprfs:/test/bigtable/2016/01/0_0_0.parquet], ReadEntr=
yWithPath [path=3Dmaprfs:/test/bigtable/2016/parquet.file]], selectionRoot=
=3D/test/bigtable, numFiles=3D2, columns=3D[`*`]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "parquet-scan",
>     "@id" : 6,
>     "entries" : [ {
>       "path" : "maprfs:/test/bigtable/2016/01/0_0_0.parquet"
>     }, {
>       "path" : "maprfs:/test/bigtable/2016/parquet.file"
>     } ],
> {code}
> And finally, when we use the same table in the from clause and in maxdir/=
mindir, we scan only one file (to return schema):
> I would think that the same should happen in the bug case above ...
> {code}
> 0: jdbc:drill:schema=3Ddfs> explain plan for select * from bigtable where=
 dir0 =3D maxdir('dfs.test', 'bigtable') and dir1 =3D mindir('dfs.test', 'b=
igtable/2016');
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(*=3D[$0])
> 00-02        Project(T29=C2=A6=C2=A6*=3D[$0])
> 00-03          SelectionVectorRemover
> 00-04            Filter(condition=3D[AND(=3D($1, '2016'), =3D($2, 'parque=
t.file'))])
> 00-05              Project(T29=C2=A6=C2=A6*=3D[$0], dir0=3D[$1], dir1=3D[=
$2])
> 00-06                Scan(groupscan=3D[ParquetGroupScan [entries=3D[ReadE=
ntryWithPath [path=3Dmaprfs:/test/bigtable/2015/01/4_0_0.parquet]], selecti=
onRoot=3D/test/bigtable, numFiles=3D1, columns=3D[`*`]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "parquet-scan",
>     "@id" : 6,
>     "entries" : [ {
>       "path" : "maprfs:/test/bigtable/2015/01/4_0_0.parquet"
>     } ],
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)