drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arina Ielchiieva (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-4735) Count(dir0) on parquet returns 0 result
Date Fri, 21 Jul 2017 13:16:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096238#comment-16096238
] 

Arina Ielchiieva edited comment on DRILL-4735 at 7/21/17 1:15 PM:
------------------------------------------------------------------

During implementation some details have changed:
1. It turned out that we can get access to session options directly in {{ConvertCountToDirectScan}}
using {{PrelUtil.getPlannerSettings(call.getPlanner())}} so now there is no need to pass {{OptimizerRulesContext}}
to {{ConvertCountToDirectScan}}. We will skip applying this rule if directory column is present
in selection, on the contrary for implicit columns, we'll set count result to total records
count since, they are based on the files and there is no data without a file. Also there has
been done some refactoring in {{ConvertCountToDirectScan}}, counts collection logic was encapsulated
in {{CountsCollector}} class which is a helper class.

2. We still introduced {{DynamicPojoRecordReader}} class but it would accept two parameters.
First schema represented by {{LinkedHashMap<String, Class<?>>}} and second records
itself represented by {{List<List<T>>}}. We force user to pass schema to cover
the case when there is no records to be read but we still need schema to proceed. If records
of the same type, user may set {{T}} to that very type, if records contains different types,
{{T}} should be set to {{Object}}.

3. {{MetadataDirectGroupScan}} string representation now includes also number of files: 
{noformat}
[usedMetadata = true, files = [/tpch/nation.parquet], numFiles = 1]
{noformat}


was (Author: arina):
During implementation some details has changed:
1. It turned out that we can get access to session options directly in {{ConvertCountToDirectScan}}
using {{PrelUtil.getPlannerSettings(call.getPlanner())}} so now there is no need to pass {{OptimizerRulesContext}}
to {{ConvertCountToDirectScan}}. We will skip applying this rule if directory column is present
in selection, on the contrary for implicit columns, we'll set count result to total records
count since, they are based on the files and there is no data without a file. Also there has
been done some refactoring in {{ConvertCountToDirectScan}}, counts collection logic was encapsulated
in {{CountsCollector}} class which is a helper class.

2. We still introduced {{DynamicPojoRecordReader}} class but it would accept two parameters.
First schema represented by {{LinkedHashMap<String, Class<?>>}} and second records
itself represented by {{List<List<T>>}}. We force user to pass schema to cover
the case when there is no records to be read but we still need schema to proceed. If records
of the same type, user may set {{T}} to that very type, if records contains different types,
{{T}} should be set to {{Object}}.

3. {{MetadataDirectGroupScan}} string representation now includes also number of files: 
{noformat}
[usedMetadata = true, files = [/tpch/nation.parquet], numFiles = 1]
{noformat}

> Count(dir0) on parquet returns 0 result
> ---------------------------------------
>
>                 Key: DRILL-4735
>                 URL: https://issues.apache.org/jira/browse/DRILL-4735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization, Storage - Parquet
>    Affects Versions: 1.0.0, 1.4.0, 1.6.0, 1.7.0
>            Reporter: Krystal
>            Assignee: Arina Ielchiieva
>            Priority: Critical
>
> Selecting a count of dir0, dir1, etc against a parquet directory returns 0 rows.
> select count(dir0) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> select count(dir1) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> If I put both dir0 and dir1 in the same select, it returns expected result:
> select count(dir0), count(dir1) from `min_max_dir`;
> +---------+---------+
> | EXPR$0  | EXPR$1  |
> +---------+---------+
> | 600     | 600     |
> +---------+---------+
> Here is the physical plan for count(dir0) query:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0, cumulative cost
= {22.0 rows, 22.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1346
> 00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0,
cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1345
> 00-02        Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0,
cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1344
> 00-03          Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@3da85d3b[columns
= null, isStarQuery = false, isSkipQuery = false]]) : rowType = RecordType(BIGINT count):
rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 1343
> {code}
> Here is part of the explain plan for the count(dir0) and count(dir1) in the same select:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0,
cumulative cost = {1206.0 rows, 15606.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1623
> 00-01      Project(EXPR$0=[$0], EXPR$1=[$1]) : rowType = RecordType(BIGINT EXPR$0, BIGINT
EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows, 15600.0 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 1622
> 00-02        StreamAgg(group=[{}], EXPR$0=[COUNT($0)], EXPR$1=[COUNT($1)]) : rowType
= RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows,
15600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1621
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1999/Apr/voter20.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1999/MAR/voter15.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1985/jan/voter5.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1985/apr/voter60.parquet/0_0_0.parquet],...,
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/2014/jul/voter35.parquet/0_0_0.parquet]],
selectionRoot=maprfs:/drill/testdata/min_max_dir, numFiles=16, usedMetadataFile=false, columns=[`dir0`,
`dir1`]]]) : rowType = RecordType(ANY dir0, ANY dir1): rowcount = 600.0, cumulative cost =
{600.0 rows, 1200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1620
> {code}
> Notice that in the first case, "org.apache.drill.exec.store.pojo.PojoRecordReader" is
used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message