hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: count(*) not allowed in order by
Date Mon, 07 Mar 2016 17:49:13 GMT
Hi,

You arte looking at the top 25 of result set so you will have to get full
result set before looking at top 25

Something like this

 select rs.prod_id, rs.score from
 (
   prod_id, count(prod_id) AS Score from sales GROUP BY prod_id ORDER BY
Score DESC
 )rs
 LIMIT 25;

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 7 March 2016 at 17:04, Awhan Patnaik <awhan@spotzot.com> wrote:

> I have to take the first 25 IDs ranked by count(*). But the following is
> not allowed in Hive
>
> select id from T order by count(*) desc limit 25;
>
> Which yields a "NOt yet supported place for UDAF count". The way around it
> is the following
>
> select id, count(*) as cnt from T group by id order by cnt desc limit 25;
>
> However I need to put this query in a subquery like so
>
> select id, XXX from T t join .... where t.id in (select id, count(*) as
> cnt from T group by id order by cnt desc limit 25) group by ... ;
>
> which does NOT work because a subquery is allowed to return only one
> thing. Here XXX are complex constructs like distance calculation and
> binning. These are time consuming and complex operations.
>
> The only way I see this would work is if I use the following sub-sub query
>
> select id, XXX from T t join ... where t.id in (select sub.id from
> (select id, count(*) as cnt from T group by id order by cnt desc limit
> 25)sub) group by ... ;
>
> The reason I don't want to put the limit in the outermost query is because
> those XXX queries are expensive and I don't want them to be performed on
> the entire result only to retain the top 25 and throw away the rest of the
> results. The count(*) operation of the rest of the IDs is not very
> expensive or time consuming.
>
> Is there some other elegant way of handling this without using a
> sub-sub-query approach?
>

Mime
View raw message