drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5542) Scan unnecessary adds implicit columns to ScanRecordBatch for select * query
Date Fri, 26 May 2017 00:53:04 GMT
Jinfeng Ni created DRILL-5542:
---------------------------------

             Summary: Scan unnecessary adds implicit columns to ScanRecordBatch for select
* query
                 Key: DRILL-5542
                 URL: https://issues.apache.org/jira/browse/DRILL-5542
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
            Reporter: Jinfeng Ni


It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`)
to ScanBatch, where it's actually not required at downstream operator. Although those implicit
columns would be dropped off later on, it increases both memory and CPU overhead.    

1. JSON
```
{a: 100}
```

{code}
select * from dfs.tmp.`1.json`;
+------+
|  a   |
+------+
| 100  |
+------+
{code}

The schema from ScanRecordBatch is :
{code}
[ schema:
    BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL),
suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE], 
 {code}

2. Parquet
{code}
elect * from cp.`tpch/nation.parquet`;
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| n_nationkey  |     n_name      | n_regionkey  |                                        
             n_comment                                                      |
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| 0            | ALGERIA         | 0            |  haggle. carefully final deposits detect
slyly agai                                                                 |
...
{code}

The schema of ScanRecordBatch:
{code}
  schema:
    BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED),
n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL),
suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], 
{code}

3. Text
{code}
cat 1.csv
a, b, c

select * from dfs.tmp.`1.csv`;
+----------------+
|    columns     |
+----------------+
| ["a","b","c"]  |
+----------------+
{code}

Schema of ScanRecordBatch 
{code}
  schema:
    BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL),
filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],

{code}

If implicit columns are not part of query result of `select * query`, then Scan operator should
not populate those implicit columns.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message