drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-1716) Nested Data : There should be an easy way to apply aggregate functions on repeated types
Date Mon, 17 Nov 2014 01:35:38 GMT

     [ https://issues.apache.org/jira/browse/DRILL-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jacques Nadeau updated DRILL-1716:
----------------------------------
    Fix Version/s: 0.9.0

> Nested Data : There should be an easy way to apply aggregate functions on repeated types
> ----------------------------------------------------------------------------------------
>
>                 Key: DRILL-1716
>                 URL: https://issues.apache.org/jira/browse/DRILL-1716
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Functions - Drill, Storage - JSON
>            Reporter: Rahul Challapalli
>             Fix For: 0.9.0
>
>
> {code}
> {
>     "company_id": 1,
>     "evnts": [
>         {
>             "evnt_id": 999,
>             "evnt_duration": 60
>         },
>         {
>             "evnt_id": 998,
>             "evnt_duration": 30
>         },
>         {
>             "evnt_id": 997,
>             "evnt_duration": 45
>         }
>     ]
> }
> {code}
> For the above dataset, if I want to find the longest duration for each company id, below
is how I would do it now
> {code}
> select sub.company_id , max(sub.evnt.evnt_duration) max_duration 
> from ( 
>     select company_id, flatten(evnts) evnt from `nested.json`
> ) sub 
> group by sub.company_id;
> +------------+--------------+
> | company_id | max_duration |
> +------------+--------------+
> | 1          | 60           |
> +------------+--------------+
> {code}
> Now if I want the evnt_id associated with the longest duration then we need one more
join
> {code}
> select a.company_id, b.evnt.evnt_id 
> from (
>     select sub.company_id company_id, max(sub.evnt.evnt_duration) max_duration 
>     from ( 
>         select company_id, flatten(evnts) evnt from `nested.json`
>     ) sub 
>     group by sub.company_id
> ) a 
> join 
> (
>     select flatten(evnts) evnt from `nested.json`
> ) b 
> on a.max_duration = b.evnt.evnt_duration;
> {code}
> The above query currently fails (DRILL-1649). But that is how we have to do it currently.
> It would be much simpler if I can do something like the below
> {code}
> select company_id, nested_agg('evnts', 'max', 'evnt_duration','evnt_id') as evnt_id from
`nested.json`;
> {code}
> Apart from making the query much simpler to write this might enhance drill's performance
as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message