drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-1499) Different column order could appear in the result set for a schema-less select * query, even there are no changing schemas.
Date Mon, 06 Oct 2014 16:58:38 GMT
Jinfeng Ni created DRILL-1499:
---------------------------------

             Summary: Different column order could appear in the result set for a schema-less
select * query, even there are no changing schemas.
                 Key: DRILL-1499
                 URL: https://issues.apache.org/jira/browse/DRILL-1499
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Jinfeng Ni
            Assignee: Jinfeng Ni


For a select * query referring to a schema-less table, Drill could return different column,
depending on the physical operators the query involves:

Q1:

{code}
select * from cp.`employee.json` limit 3;
+-------------+------------+------------+------------+-------------+----------------+------------+---------------+------------+------------+------------+---------------+-----------------+----------------+------------+-----------------+
| employee_id | full_name  | first_name | last_name  | position_id | position_title |  store_id
 | department_id | birth_date | hire_date  |   salary   | supervisor_id | education_level
| marital_status |   gender   | management_role |
+-------------+------------+------------+------------+-------------+----------------+------------+---------------+------------+------------+------------+---------------+-----------------+----------------+------------+-----------------+
{code}

Q2:

{code}
select * from cp.`employee.json` order by last_name limit 3;
+------------+---------------+-----------------+-------------+------------+------------+------------+------------+------------+-----------------+----------------+-------------+----------------+------------+------------+---------------+
| birth_date | department_id | education_level | employee_id | first_name | full_name  | 
 gender   | hire_date  | last_name  | management_role | marital_status | position_id | position_title
|   salary   |  store_id  | supervisor_id |
+------------+---------------+-----------------+-------------+------------+------------+------------+------------+------------+-----------------+----------------+-------------+----------------+------------+------------+---------------+
{code}

The difference between Q1 and Q2 is the order by clause.  With order by clause in Q2, Drill
will sort the column names alphabetically, while for Q1, the column names are in the same
order as in the data source. 

The underlying cause for such difference is that the sort or sort-based merger operator would
require canonicalization, since the incoming batches could contain different schemas. 

 However, it would be better that such canonicalization is used only when the incoming batches
have changing schemas. If all the incoming batches have identical schemas, no need to sort
the column orders.  With this fix, Drill will present the same column order in the result
set, for a schema-less select * query,  if there is no changing schemas from incoming data
sources. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message