drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Altekruse (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-4308) Aggregate operations on dir<N> columns can be more efficient for certain use cases
Date Fri, 29 Jan 2016 22:27:40 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124340#comment-15124340
] 

Jason Altekruse edited comment on DRILL-4308 at 1/29/16 10:27 PM:
------------------------------------------------------------------

Hey [~amansinha100], I tried re-creating this and I was not able to see this behavior. I only
created the folder structure on my local machine, but it looked like this, I seem to be getting
correct results for these types of queries.

{code}
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = mindir('dfs.mxd','mock_data')
limit 1;
+-------+
| dir0  |
+-------+
| 1994  |
+-------+
1 row selected (0.127 seconds)
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = maxdir('dfs.mxd','mock_data')
limit 1;
+-------+
| dir0  |
+-------+
| 1997  |
+-------+
1 row selected (0.123 seconds)



Jasons-MacBook-Pro:maxdir jaltekruse$ tree mock_data/
mock_data/
├── 1994
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1995
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1996
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
└── 1997
    ├── Q1
    │   └── data.csv
    ├── Q2
    │   └── data.csv
    ├── Q3
    │   └── data.csv
    └── Q4
        └── data.csv
{code}


was (Author: jaltekruse):
Hey [~amansinha100], I tried re-creating this and I was not able to see this behavior. I only
created the folder structure on my local machine, but it looked like this, I seems to be getting
correct results for these types of queries.

{code}
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = mindir('dfs.mxd','mock_data')
limit 1;
+-------+
| dir0  |
+-------+
| 1994  |
+-------+
1 row selected (0.127 seconds)
0: jdbc:drill:zk=local> select dir0 from mock_data where dir0 = maxdir('dfs.mxd','mock_data')
limit 1;
+-------+
| dir0  |
+-------+
| 1997  |
+-------+
1 row selected (0.123 seconds)



Jasons-MacBook-Pro:maxdir jaltekruse$ tree mock_data/
mock_data/
├── 1994
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1995
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
├── 1996
│   ├── Q1
│   │   └── data.csv
│   ├── Q2
│   │   └── data.csv
│   ├── Q3
│   │   └── data.csv
│   └── Q4
│       └── data.csv
└── 1997
    ├── Q1
    │   └── data.csv
    ├── Q2
    │   └── data.csv
    ├── Q3
    │   └── data.csv
    └── Q4
        └── data.csv
{code}

> Aggregate operations on dir<N> columns can be more efficient for certain use cases
> ----------------------------------------------------------------------------------
>
>                 Key: DRILL-4308
>                 URL: https://issues.apache.org/jira/browse/DRILL-4308
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.4.0
>            Reporter: Aman Sinha
>
> For queries that perform plain aggregates or DISTINCT operations on the directory partition
columns (dir0, dir1 etc.) and there are no other columns referenced in the query, the performance
could be substantially improved by not having to scan the entire dataset.   
> Consider the following types of queries:
> {noformat}
> select  min(dir0) from largetable;
> select  distinct dir0 from largetable;
> {noformat}
> The number of distinct values of dir<N> columns is typically quite small and there's
no reason to scan the large table.  This is also come as user feedback from some Drill users.
 Of course, if there's any other column referenced in the query (WHERE, ORDER-BY etc.) then
we cannot apply this optimization.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message