drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Barclay <dbarc...@maprtech.com>
Subject How is source's table schema supposed to be passed when zero rows?
Date Sat, 25 Jul 2015 22:35:55 GMT
How is the table schema from a source (e.g, an INFORMATION_SCHEMA
table, a JSON file, etc.) supposed to get passed through the operator
tree and/or batches, especially at the beginning around scans?

In particular, what's supposed to make sure that the schema gets
created/passed even if there are zero rows in the data?


I've run into several cases where the net schema (i.e., the set of
columns shown by SQLLine or reported by JDBC's ResultSetMetaData) ends
up being empty when there are zero rows in the data.

It seems that instead of being set explicitly (from whatever's
available from the data source):
- the schema is derived only implicitly from value vector wrappers
  (with MaterializedField data),
- but the value vector wrappers are created only when first trying to
   write data (created when processing the first row's values)--
- so when there are zero rows, there are no value vector wrappers, and
   the schema ends up empty.


One case occurs with the INFORMATION_SCHEMA plug-in.  In the following,
notice that when there are zero rows because of pushed-down filtering
(notice the TABLE_SCHEMA and TABLE_NAME cases), the schema is empty,
but if there are zero rows because of other filtering or limiting,
the schema is correct and has all the statically known columns:

 > SELECT * FROM INFORMATION_SCHEMA.`TABLES` WHERE TABLE_CATALOG = 'NoSuch';
+----------------+---------------+-------------+-------------+
| TABLE_CATALOG  | TABLE_SCHEMA  | TABLE_NAME  | TABLE_TYPE  |
+----------------+---------------+-------------+-------------+
+----------------+---------------+-------------+-------------+
No rows selected (0.194 seconds)

 > SELECT * FROM INFORMATION_SCHEMA.`TABLES` WHERE TABLE_SCHEMA = 'NoSuch';
+--+
|  |
+--+
+--+
No rows selected (0.279 seconds)

 > SELECT * FROM INFORMATION_SCHEMA.`TABLES` WHERE TABLE_NAME = 'NoSuch';
+--+
|  |
+--+
+--+
No rows selected (0.283 seconds)

 > SELECT * FROM INFORMATION_SCHEMA.`TABLES` WHERE TABLE_TYPE = 'NoSuch';
+----------------+---------------+-------------+-------------+
| TABLE_CATALOG  | TABLE_SCHEMA  | TABLE_NAME  | TABLE_TYPE  |
+----------------+---------------+-------------+-------------+
+----------------+---------------+-------------+-------------+
No rows selected (0.26 seconds)

 > SELECT * FROM INFORMATION_SCHEMA.`TABLES` WHERE FALSE;
+----------------+---------------+-------------+-------------+
| TABLE_CATALOG  | TABLE_SCHEMA  | TABLE_NAME  | TABLE_TYPE  |
+----------------+---------------+-------------+-------------+
+----------------+---------------+-------------+-------------+
No rows selected (0.219 seconds)

 > SELECT * FROM INFORMATION_SCHEMA.`TABLES` LIMIT 0;
+----------------+---------------+-------------+-------------+
| TABLE_CATALOG  | TABLE_SCHEMA  | TABLE_NAME  | TABLE_TYPE  |
+----------------+---------------+-------------+-------------+
+----------------+---------------+-------------+-------------+
No rows selected (0.269 seconds)


(The underlying problem is not specific to the INFORMATION_SCHEMA
plug-in:  If I modify a system table to have zero rows, the schema
ends of having no columns.)



Another case is an empty (zero-byte) JSON file.  In the following,
notice how the "a, b, c" part of the <select list> shows up for query
Q2, as expected, but does not for query Q4:

Q1:
 > SELECT * FROM `dfs.root`.`/tmp/one_object_with_one_property.json`;
+----+
| x  |
+----+
| y  |
+----+
1 row selected (0.169 seconds)

Q2:
 > SELECT a, b, c, * FROM `dfs.root`.`/tmp/one_object_with_one_property.json`;
+-------+-------+-------+----+
|   a   |   b   |   c   | x  |
+-------+-------+-------+----+
| null  | null  | null  | y  |
+-------+-------+-------+----+
1 row selected (0.16 seconds)

Q3:
 > SELECT * FROM `dfs.root`.`/tmp/empty.json`;
+--+
|  |
+--+
+--+
No rows selected (0.161 seconds)

Q4:
 > SELECT a, b, c, * FROM `dfs.root`.`/tmp/empty.json`;
+--+
|  |
+--+
+--+
No rows selected (0.145 seconds)


So, how and where is schema information supposed to be provided so
that it doesn't depend on there being any rows to create value vector
wrappers?


Thanks,
Daniel
-- 
Daniel Barclay
MapR Technologies

Mime
View raw message