drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andries Engelbrecht (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-2157) Directory pruning on subdirectories only and data type conversions for directory filters
Date Wed, 04 Feb 2015 16:03:35 GMT

    [ https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305368#comment-14305368
] 

Andries Engelbrecht commented on DRILL-2157:
--------------------------------------------

In addition the ability to perform directory pruning with < and > expressions.

Also if the ability to support directory pruning in views where there is a predicate filter
i.e. the view below is unable to perform directory pruning while views without the predicate
filter are.

Create or replace view maprfs.views.`retweeted` as
select CAST(t.`id` as BIGINT) as `id`, 
CAST(t.retweeted_status.`id` as BIGINT) as `retweet_id`,
t.dir0 as dir_year,
t.dir1 as dir_month,
t.dir2 as dir_day,
t.dir3 as dir_hour,
CAST(t.retweeted_status.`created_at` as VARCHAR(40)) as `created_at`,
to_date ((concat (substring(t.retweeted_status.`created_at`, 5,6),substring(t.retweeted_status.`created_at`,
26,5))), 'MMM dd yyyy') as `date`,
to_timestamp ((concat (substring(t.retweeted_status.`created_at`, 5,6),substring(t.retweeted_status.`created_at`,
26,5),substring(t.retweeted_status.`created_at`, 11,9))), 'MMM dd yyyy HH:mm:ss') as `timestamp`,
CAST(t.retweeted_status.`text` as VARCHAR(140)) as `tweet`,
CAST(t.retweeted_status.`user`.`favorites_count` as INT) as `favorites_count`
from maprfs.twitter.`/nfl` t where t.retweeted_status.`user`.`name` is not null;


> Directory pruning on subdirectories only and data type conversions for directory filters
> ----------------------------------------------------------------------------------------
>
>                 Key: DRILL-2157
>                 URL: https://issues.apache.org/jira/browse/DRILL-2157
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: Future
>            Reporter: Andries Engelbrecht
>            Assignee: Jinfeng Ni
>            Priority: Minor
>
> Drill will scan all files and directories when using only a subdirectory as a predicate.
Additionally if the data type for the directory filter is not a string and is converted Drill
will also first scan all the subdirectories adn files before applying the filter.
> My current observation is that for a directory structure as listed below,
> the pruning only works if the full tree is provided. If only a lower level
> directory is supplied in the filter condition Drill only uses it as a
> filter.
> With directory structure as below
> /2015
>         /01
>                /10
>                /11
>                /12
>                /13
>                /14
> Query:
> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
> dir2='10'
> Produces the correct pruning and query plan
> 01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
> 28434
> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
> 01-04                Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
> However:
> select count(id) from `/foo` t where dir2='10'
> Produces full scan of all sub directories and only applies a filter
> condition after the fact. Notice the numFiles between the 2, even though it
> lists columns in the base scan
> 01-04                Filter(condition=[=($0, '10')]): rowcount =
> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
> 0.0 network, 0.0 memory}, id = 27470
> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
> io, 0.0 network, 0.0 memory}, id = 27469
> 01-06                    Scan(groupscan=[EasyGroupScan
> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
> Also using the wrong data type for the filter produces a full scan
> select count(id) from `/foo` where dir_year=2015 and dir_month=01 and dir_day=14
> Produces
> 01-04                Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), =(CAST($2):ANY
NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 212034.63825, cumulative cost =
{1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34910
> 01-05                  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount =
6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 34909
> 01-06                    Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, numFiles=405,
columns=[`id`, `dir0`, `dir1`, `dir2`],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message