drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Omernik (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-4379) Unexpected Table Behavior with only one subdirectory vs. Many
Date Tue, 09 Feb 2016 14:06:18 GMT
John Omernik created DRILL-4379:
-----------------------------------

             Summary: Unexpected Table Behavior with only one subdirectory vs. Many 
                 Key: DRILL-4379
                 URL: https://issues.apache.org/jira/browse/DRILL-4379
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 1.4.0
            Reporter: John Omernik


A common practice is to use directories below a main directory as a partitioning device. 
Say you have a table named "myawesomedata" and you get data into that table every day, it
would be valuable to create the main directory, then subdirectories per day to help optimize
queries running against only certain days of data.

/myawesomedata/
/myawesomedata/2016-02-01
/myawesomedata/2016-02-02
/myawesomedata/2016-02-03
/myawesomedata/2016-02-04

I have identified a condition that if there is ONLY one subdirectory, queries do not return
results as expected by a user. 

Example:

In the above, if I run a query of 

select count(1) from `myawesomedata`;

I get accurate results of the count in all subdirectories

If I run:

select count(1) from `myawesomedata` where dir0 = '2016-02-01';

I get accurate results of the count of only the subdirectory 2016-02-01

However, if I delete subdirectories 2016-02-02, 2016-02-03, and 2016-02-04 and am left with:

/myawesomedata/
/myawesomedata/2016-02-01

Then if I run 

select count(1) from `myawesomedata`;

It returns the accurate count (which is just that of the 2016-02-01 directory). 

However, if I run

select count(1) from `myawesomedata` where dir0 = '2016-02-01';

It takes much longer (15 seconds vs instant on the other queries) and returns no results.
 Even though this is  the same query as above that worked with 2 or more subdirectories. 
Basically, when there is only one subdirectory, a query asking for only that directory does
not work in the same way as when there are more subdirectories.  This is an unexpected user
experience and something I believe could cause user frustration and unexpected results from
Drill usage on data. 

 






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message