Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8865017543 for ; Thu, 16 Apr 2015 17:51:59 +0000 (UTC) Received: (qmail 28346 invoked by uid 500); 16 Apr 2015 17:51:59 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 28315 invoked by uid 500); 16 Apr 2015 17:51:59 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 28305 invoked by uid 99); 16 Apr 2015 17:51:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Apr 2015 17:51:59 +0000 Date: Thu, 16 Apr 2015 17:51:59 +0000 (UTC) From: "Victoria Markman (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-2794) Partition pruning is not happening correctly (results in a full table scan) when maxdir/mindir is used in the filter condition MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-2794?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1449= 8378#comment-14498378 ]=20 Victoria Markman commented on DRILL-2794: ----------------------------------------- After further investigation it turned out that my files were very small (be= low 1K) and scan was costed as 0. At the moment I increased one of the files size to be larger than 1K, I got= behavior that was expected: scan of just one file. This issue is a duplicate of https://issues.apache.org/jira/browse/DRILL-25= 53 > Partition pruning is not happening correctly (results in a full table sca= n) when maxdir/mindir is used in the filter condition > -------------------------------------------------------------------------= ----------------------------------------------------- > > Key: DRILL-2794 > URL: https://issues.apache.org/jira/browse/DRILL-2794 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Affects Versions: 0.9.0 > Reporter: Victoria Markman > Assignee: Victoria Markman > > Directory structure: > {code} > [Tue Apr 14 13:43:54 root@/mapr/vmarkman.cluster.com/test/smalltable ] # = ls -R > .: > 2014 2015 2016 > ./2014: > ./2015: > 01 02 > ./2015/01: > t1.csv > ./2015/02: > t2.csv > ./2016: > t1.csv > [Tue Apr 14 13:44:26 root@/mapr/vmarkman.cluster.com/test/bigtable ] # ls= -R > .: > 2015 2016 > ./2015: > 01 02 03 04 > ./2015/01: > 0_0_0.parquet 1_0_0.parquet 2_0_0.parquet 3_0_0.parquet 4_0_0.parquet= 5_0_0.parquet > ./2015/02: > 0_0_0.parquet > ./2015/03: > 0_0_0.parquet > ./2015/04: > 0_0_0.parquet > ./2016: > 01 parquet.file > ./2016/01: > 0_0_0.parquet > {code} > Simple case, partition pruning is happening correctly: only 2016 director= y is scanned from 'smalltable'. > {code} > 0: jdbc:drill:schema=3Ddfs> explain plan for select * from smalltable whe= re dir0 =3D maxdir('dfs.test', 'bigtable'); > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 Project(*=3D[$0]) > 00-02 Project(*=3D[$0]) > 00-03 Scan(groupscan=3D[EasyGroupScan [selectionRoot=3D/test/sma= lltable, numFiles=3D1, columns=3D[`*`], files=3D[maprfs:/test/smalltable/20= 16/t1.csv]]]) > | { > "head" : { > "version" : 1, > "generator" : { > "type" : "ExplainHandler", > "info" : "" > }, > "type" : "APACHE_DRILL_PHYSICAL", > "options" : [ ], > "queue" : 0, > "resultMode" : "EXEC" > }, > "graph" : [ { > "pop" : "fs-scan", > "@id" : 3, > "files" : [ "maprfs:/test/smalltable/2016/t1.csv" ], > "storage" : { > "type" : "file", > "enabled" : true, > "connection" : "maprfs:///", > "workspaces" : { > "root" : { > "location" : "/", > "writable" : false, > "defaultInputFormat" : null > }, > ... > ... > {code} > With added second predicate (dir1 =3D mindir('dfs.test', 'bigtable/2016')= which evaluates to false (there is no directory '01' in smalltable) > we end up scanning everything in the smalltable. This does not look right= to me and I think this is a bug. > {code} > 0: jdbc:drill:schema=3Ddfs> explain plan for select * from smalltable whe= re dir0 =3D maxdir('dfs.test', 'bigtable') and dir1 =3D mindir('dfs.test', = 'bigtable/2016'); > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 Project(*=3D[$0]) > 00-02 Project(T15=C2=A6=C2=A6*=3D[$0]) > 00-03 SelectionVectorRemover > 00-04 Filter(condition=3D[AND(=3D($1, '2016'), =3D($2, '01'))]= ) > 00-05 Project(T15=C2=A6=C2=A6*=3D[$0], dir0=3D[$1], dir1=3D[= $2]) > 00-06 Scan(groupscan=3D[EasyGroupScan [selectionRoot=3D/te= st/smalltable, numFiles=3D3, columns=3D[`*`], files=3D[maprfs:/test/smallta= ble/2015/01/t1.csv, maprfs:/test/smalltable/2015/02/t2.csv, maprfs:/test/sm= alltable/2016/t1.csv]]]) > | { > "head" : { > "version" : 1, > "generator" : { > "type" : "ExplainHandler", > "info" : "" > }, > "type" : "APACHE_DRILL_PHYSICAL", > "options" : [ ], > "queue" : 0, > "resultMode" : "EXEC" > }, > "graph" : [ { > "pop" : "fs-scan", > "@id" : 6, > "files" : [ "maprfs:/test/smalltable/2015/01/t1.csv", "maprfs:/test/s= malltable/2015/02/t2.csv", "maprfs:/test/smalltable/2016/t1.csv" ], > "storage" : { > "type" : "file", > "enabled" : true, > "connection" : "maprfs:///", > "workspaces" : { > "root" : { > "location" : "/", > "writable" : false, > "defaultInputFormat" : null > }, > ... > ... > {code} > Here is a similar example with parquet file where predicate "a1=3D11" eva= luates to false. > {code} > 0: jdbc:drill:schema=3Ddfs> explain plan for select * from bigtable where= dir0=3Dmaxdir('dfs.test','bigtable') and a1 =3D 11; > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 Project(*=3D[$0]) > 00-02 Project(T25=C2=A6=C2=A6*=3D[$0]) > 00-03 SelectionVectorRemover > 00-04 Filter(condition=3D[AND(=3D($1, '2016'), =3D($2, 11))]) > 00-05 Project(T25=C2=A6=C2=A6*=3D[$0], dir0=3D[$1], a1=3D[$2= ]) > 00-06 Scan(groupscan=3D[ParquetGroupScan [entries=3D[ReadE= ntryWithPath [path=3Dmaprfs:/test/bigtable/2016/01/0_0_0.parquet], ReadEntr= yWithPath [path=3Dmaprfs:/test/bigtable/2016/parquet.file]], selectionRoot= =3D/test/bigtable, numFiles=3D2, columns=3D[`*`]]]) > | { > "head" : { > "version" : 1, > "generator" : { > "type" : "ExplainHandler", > "info" : "" > }, > "type" : "APACHE_DRILL_PHYSICAL", > "options" : [ ], > "queue" : 0, > "resultMode" : "EXEC" > }, > "graph" : [ { > "pop" : "parquet-scan", > "@id" : 6, > "entries" : [ { > "path" : "maprfs:/test/bigtable/2016/01/0_0_0.parquet" > }, { > "path" : "maprfs:/test/bigtable/2016/parquet.file" > } ], > {code} > And finally, when we use the same table in the from clause and in maxdir/= mindir, we scan only one file (to return schema): > I would think that the same should happen in the bug case above ... > {code} > 0: jdbc:drill:schema=3Ddfs> explain plan for select * from bigtable where= dir0 =3D maxdir('dfs.test', 'bigtable') and dir1 =3D mindir('dfs.test', 'b= igtable/2016'); > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 Project(*=3D[$0]) > 00-02 Project(T29=C2=A6=C2=A6*=3D[$0]) > 00-03 SelectionVectorRemover > 00-04 Filter(condition=3D[AND(=3D($1, '2016'), =3D($2, 'parque= t.file'))]) > 00-05 Project(T29=C2=A6=C2=A6*=3D[$0], dir0=3D[$1], dir1=3D[= $2]) > 00-06 Scan(groupscan=3D[ParquetGroupScan [entries=3D[ReadE= ntryWithPath [path=3Dmaprfs:/test/bigtable/2015/01/4_0_0.parquet]], selecti= onRoot=3D/test/bigtable, numFiles=3D1, columns=3D[`*`]]]) > | { > "head" : { > "version" : 1, > "generator" : { > "type" : "ExplainHandler", > "info" : "" > }, > "type" : "APACHE_DRILL_PHYSICAL", > "options" : [ ], > "queue" : 0, > "resultMode" : "EXEC" > }, > "graph" : [ { > "pop" : "parquet-scan", > "@id" : 6, > "entries" : [ { > "path" : "maprfs:/test/bigtable/2015/01/4_0_0.parquet" > } ], > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)