drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arina Ielchiieva (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-4735) Count(dir0) on parquet returns 0 result
Date Mon, 17 Jul 2017 13:17:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089662#comment-16089662
] 

Arina Ielchiieva edited comment on DRILL-4735 at 7/17/17 1:16 PM:
------------------------------------------------------------------

Fix will cover the following aspects:

1. {{ConvertCountToDirectScan}} will be able to distinguish between implicit / directory and
non-existent columns, relates to current Jira DRILL-4735.
To achieve this `Agg_on_scan` and `Agg_on_proj_on_scan` rules will take new parameter in constructor
{{OptimizerRulesContext}}, similar for [prune scan rules|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java#L345].
It will help to find out if column is implicit / directory or not. {{OptimizerRulesContext}}
has access to session options through {{PlannerSettings}} which are crucial for defining current
implicit / directory column names. After we receive list of columns to which rule will be
applied, we'll checks if this list contains implicit or directory columns. If contains, we
won't apply the rule.

2. {{ConvertCountToDirectScan}} rule to be applicable for 2 or more COUNT aggregates, relates
to DRILL-1691.
When column statistics is available we use {{PojoRecordReader}} to return its value. {{PojoRecordReader}}
requires exact model. In our case we'll need reader that will allow dynamic model usage (the
one where we don't know how many columns it will have). For this purpose {{DynamicPojoRecordReader}}
will be used. Instead of exact model it will accept dynamic model represented by {{List<LinkedHashMap<String,
Object>> records}}, list of maps where key -> column name, value -> column value.
Common logic between  {{PojoRecordReader}} and {{DynamicPojoRecordReader}} will extracted
in abstract parent class.

3. Currently when {{ConvertCountToDirectScan}} is applied in plan we see the following:
{noformat}
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@60017daf[columns = null,
isStarQuery = false, isSkipQuery = false]])
{noformat}
User has no idea that column statistics was used to calculate the result, if partition pruning
took place etc, relates to DRILL-5357 and DRILL-3407.
Currently we use {{DirectGroupScan}} to hold our record reader. To include more information
we'll extend {{DirectGroupScan}} to {{MetadataDirectGroupScan}} which will contain information
about read files if any. Also {{PojoRecordReader}} and {{DirectPojoRecordReader}} {{toString}}
methods will be overridden to show meaningful information to the user.
Example:
{noformat}
Scan(groupscan=[usedMetadata = true, files = [/tpch/nation.parquet], DynamicPojoRecordReader{records=[{count0=25,
count1=25, count2=25}]}])
{noformat}




was (Author: arina):
Fix will cover the following aspects:

1. {{ConvertCountToDirectScan}} will be able to distinguish between implicit / directory and
non-existent columns, relates to current Jira DRILL-4735.
To achieve this `Agg_on_scan` and `Agg_on_proj_on_scan` rules will take new parameter in constructor
{{OptimizerRulesContext}}, similar for [prune scan rules|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java#L345].
It will help to find out if column is implicit / directory or not. {{OptimizerRulesContext}}
has access to session options through {{PlannerSettings}} which are crucial for defining current
implicit / directory column names. In case when [GroupScan will return column value as 0|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ConvertCountToDirectScan.java#L140],
we'll check if this column implicit / directory or not.  If not, we'll proceed with applying
our rule.

2. {{ConvertCountToDirectScan}} rule to be applicable for 2 or more COUNT aggregates, relates
to DRILL-1691.
When column statistics is available we use {{PojoRecordReader}} to return its value. {{PojoRecordReader}}
requires exact model. In our case we'll need reader that will allow dynamic model usage (the
one where we don't know how many columns it will have). For this purpose {{DynamicPojoRecordReader}}
will be used. Instead of exact model it will accept dynamic model represented by {{List<LinkedHashMap<String,
Object>> records}}, list of maps where key -> column name, value -> column value.
Common logic between  {{PojoRecordReader}} and {{DynamicPojoRecordReader}} will extracted
in abstract parent class.

3. Currently when {{ConvertCountToDirectScan}} is applied in plan we see the following:
{noformat}
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@60017daf[columns = null,
isStarQuery = false, isSkipQuery = false]])
{noformat}
User has no idea that column statistics was used to calculate the result, if partition pruning
took place etc, relates to DRILL-5357 and DRILL-3407.
Currently we use {{DirectGroupScan}} to hold our record reader. To include more information
we'll extend {{DirectGroupScan}} to {{MetadataDirectGroupScan}} which will contain information
about read files if any. Also {{PojoRecordReader}} and {{DirectPojoRecordReader}} {{toString}}
methods will be overridden to show meaningful information to the user.
Example:
{noformat}
Scan(groupscan=[usedMetadata = true, files = [/tpch/nation.parquet], DynamicPojoRecordReader{records=[{count0=25,
count1=25, count2=25}]}])
{noformat}



> Count(dir0) on parquet returns 0 result
> ---------------------------------------
>
>                 Key: DRILL-4735
>                 URL: https://issues.apache.org/jira/browse/DRILL-4735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization, Storage - Parquet
>    Affects Versions: 1.0.0, 1.4.0, 1.6.0, 1.7.0
>            Reporter: Krystal
>            Assignee: Arina Ielchiieva
>            Priority: Critical
>
> Selecting a count of dir0, dir1, etc against a parquet directory returns 0 rows.
> select count(dir0) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> select count(dir1) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> If I put both dir0 and dir1 in the same select, it returns expected result:
> select count(dir0), count(dir1) from `min_max_dir`;
> +---------+---------+
> | EXPR$0  | EXPR$1  |
> +---------+---------+
> | 600     | 600     |
> +---------+---------+
> Here is the physical plan for count(dir0) query:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0, cumulative cost
= {22.0 rows, 22.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1346
> 00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0,
cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1345
> 00-02        Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0,
cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1344
> 00-03          Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@3da85d3b[columns
= null, isStarQuery = false, isSkipQuery = false]]) : rowType = RecordType(BIGINT count):
rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 1343
> {code}
> Here is part of the explain plan for the count(dir0) and count(dir1) in the same select:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0,
cumulative cost = {1206.0 rows, 15606.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1623
> 00-01      Project(EXPR$0=[$0], EXPR$1=[$1]) : rowType = RecordType(BIGINT EXPR$0, BIGINT
EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows, 15600.0 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 1622
> 00-02        StreamAgg(group=[{}], EXPR$0=[COUNT($0)], EXPR$1=[COUNT($1)]) : rowType
= RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows,
15600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1621
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1999/Apr/voter20.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1999/MAR/voter15.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1985/jan/voter5.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1985/apr/voter60.parquet/0_0_0.parquet],...,
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/2014/jul/voter35.parquet/0_0_0.parquet]],
selectionRoot=maprfs:/drill/testdata/min_max_dir, numFiles=16, usedMetadataFile=false, columns=[`dir0`,
`dir1`]]]) : rowType = RecordType(ANY dir0, ANY dir1): rowcount = 600.0, cumulative cost =
{600.0 rows, 1200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1620
> {code}
> Notice that in the first case, "org.apache.drill.exec.store.pojo.PojoRecordReader" is
used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message