Is there some other elegant way of handling this without using a sub-sub-query approach?
I have to take the first 25 IDs ranked by count(*). But the following is not allowed in Hive
select id from T order by count(*) desc limit 25;
Which yields a "NOt yet supported place for UDAF count". The way around it is the following
select id, count(*) as cnt from T group by id order by cnt desc limit 25;
However I need to put this query in a subquery like so
select id, XXX from T t join .... where t.id
in (select id, count(*) as cnt from T group by id order by cnt desc limit 25) group by ... ;
which does NOT work because a subquery is allowed to return only one thing. Here XXX are complex constructs like distance calculation and binning. These are time consuming and complex operations.
The only way I see this would work is if I use the following sub-sub query
select id, XXX from T t join ... where t.id
in (select sub.id
from (select id, count(*) as cnt from T group by id order by cnt desc limit 25)sub) group by ... ;
The reason I don't want to put the limit in the outermost query is because those XXX queries are expensive and I don't want them to be performed on the entire result only to retain the top 25 and throw away the rest of the results. The count(*) operation of the rest of the IDs is not very expensive or time consuming.