drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arina Ielchiieva (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4735) Count(dir0) on parquet returns 0 result
Date Mon, 26 Jun 2017 09:48:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062851#comment-16062851
] 

Arina Ielchiieva commented on DRILL-4735:
-----------------------------------------

Looks like the problem is with {{ConvertCountToDirectScan}} rule when we [check number of
null values in column|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ConvertCountToDirectScan.java#L140
]. {{oldGrpScan.getColumnValueCount(SchemaPath.getSimplePath(columnName))}} will [return 0
if column does not exist|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L1052],

It will also return 0 if column has only null values. In case of dir0 or any other file system
partition or implicit columns they are not present in {{columnValueCounts}} map.
It’s good idea to convert to Direct Scan when {{oldGrpScan.getColumnValueCount}} returns
0, since count will return 0 anyway and we won’t have to spend time reading all table files.
We might return -1 for the cases when column is not found and read all table files. This will
work totally fine for file system partition and implicit columns but if column doesn’t exist
for real we’ll read all table files in vein.
Unfortunately we can’t find out if column is file system partition or implicit in {{ConvertCountToDirectScan}}
since we don’t have access to session {{OptionManager}} where current file system partition
and implicit columns names are stored (you know, they can be changed at runtime). In {{ParquetGroupScan}}
we do have access to {{OptionManager}} using [{{formatPlugin.getContext().getOptionManager()}}|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L203]
but this is system option manager and it doesn’t hold information about session options
(current file system partition and implicit columns names can be changed at session level).


> Count(dir0) on parquet returns 0 result
> ---------------------------------------
>
>                 Key: DRILL-4735
>                 URL: https://issues.apache.org/jira/browse/DRILL-4735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization, Storage - Parquet
>    Affects Versions: 1.0.0, 1.4.0, 1.6.0, 1.7.0
>            Reporter: Krystal
>            Assignee: Jinfeng Ni
>            Priority: Critical
>
> Selecting a count of dir0, dir1, etc against a parquet directory returns 0 rows.
> select count(dir0) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> select count(dir1) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> If I put both dir0 and dir1 in the same select, it returns expected result:
> select count(dir0), count(dir1) from `min_max_dir`;
> +---------+---------+
> | EXPR$0  | EXPR$1  |
> +---------+---------+
> | 600     | 600     |
> +---------+---------+
> Here is the physical plan for count(dir0) query:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0, cumulative cost
= {22.0 rows, 22.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1346
> 00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0,
cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1345
> 00-02        Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0,
cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1344
> 00-03          Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@3da85d3b[columns
= null, isStarQuery = false, isSkipQuery = false]]) : rowType = RecordType(BIGINT count):
rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 1343
> {code}
> Here is part of the explain plan for the count(dir0) and count(dir1) in the same select:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0,
cumulative cost = {1206.0 rows, 15606.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1623
> 00-01      Project(EXPR$0=[$0], EXPR$1=[$1]) : rowType = RecordType(BIGINT EXPR$0, BIGINT
EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows, 15600.0 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 1622
> 00-02        StreamAgg(group=[{}], EXPR$0=[COUNT($0)], EXPR$1=[COUNT($1)]) : rowType
= RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows,
15600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1621
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1999/Apr/voter20.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1999/MAR/voter15.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1985/jan/voter5.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/1985/apr/voter60.parquet/0_0_0.parquet],...,
ReadEntryWithPath [path=maprfs:/drill/testdata/min_max_dir/2014/jul/voter35.parquet/0_0_0.parquet]],
selectionRoot=maprfs:/drill/testdata/min_max_dir, numFiles=16, usedMetadataFile=false, columns=[`dir0`,
`dir1`]]]) : rowType = RecordType(ANY dir0, ANY dir1): rowcount = 600.0, cumulative cost =
{600.0 rows, 1200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1620
> {code}
> Notice that in the first case, "org.apache.drill.exec.store.pojo.PojoRecordReader" is
used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message