hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dilip Joseph <dilip.antony.jos...@gmail.com>
Subject Re: Top N by Group Query
Date Wed, 13 Jul 2011 15:37:04 GMT
You can also try using a custom reducer script, as follows:

FROM (
    SELECT groupCol, metric, otherFieldYouCareAbout

    FROM MyTable

    DISTRIBUTE BY groupCol

    SORT BY groupCol ASC, metric DESC

) t1

REDUCE *

    USING 'myGroupingReduceScript.py'
    AS groupCol, metric, otherFieldYouCareAbout

;


All rows of a particular group go to the same instance of
myGroupingReduceScript, in sorted order.  The script tracks the current
group and simply outputs the entire row.  After it has output 50 rows for
the current group, it ignores the rest of the rows for the group.

The disadvantage of this approach is that all rows for a particular group go
to the same reducer. If you have millions of rows per group, that may be
slow.

Dilip

On Tue, Jul 12, 2011 at 11:58 PM, Igor Tatarinov <igor@decide.com> wrote:

> I would use a UDF. Your query would look something like this:
>
> SELECT group_key, rank(group_key) AS rank
> FROM (SELECT group_key, value
>             DISTRIBUTE BY group_key
>             SORT BY group_key, value DESC)
> HAVING rank <= 50;
>
> The trick is to make the UDF keep track of the group key so that you could
> reset your rank whenever a new group starts. Also, I am pretty sure you have
> to put DISTRIBUTE/SORT BY into a subquery. If you put those in the main
> query, the output data will be partitioned so your rank() won't work.
>
> Not sure if the HAVING clause would work in Hive. You can use WHERE
> rank(group_key)<=50 instead.
>
> Hope this helps.
> igor
> decide.com
>
>
> On Mon, Jul 11, 2011 at 6:27 PM, Vipul Patel <vipulmsfpatel@gmail.com>wrote:
>
>> Hi,
>>
>> I'm trying to run a query that will return the top 50 results within
>> each group.  The table contains millions of records and there are over
>> 100 unique values in the field I wish to group by -- consequently
>> using UNION to string together subqueries won't be very efficient.
>> Any suggestions would be very much appreciated.
>>
>> Thank you in advance!
>>
>> Vipul Patel
>>
>
>


-- 
_________________________________________
Dilip Antony Joseph
http://csgrad.blogspot.com
http://www.marydilip.info

Mime
View raw message