hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michal Krawczyk <michal.krawc...@u2i.com>
Subject Possible bug with max() together with rank() and grouping sets
Date Tue, 21 Oct 2014 12:21:01 GMT
Hi all,

Recently I've run into a problem with incorrect results in one of the
queries on our system after upgrade from Hive 0.8.1.4 to 0.13.1. We use
Amazon Elastic Map Reduce servivce on Amazon. I tried to simplify the
original query and replicate this issue on a small dataset. Please take a
look at the queries below and let me know what are your thoughts.

I have the following table:
CREATE  TABLE `t`(
  `category` int,
  `live` int,
  `comments` int)

with the following data:
hive> select * from t;
OK
3       0       2
2       0       2
8       0       2

The query:
hive> select category, max(live) live, max(comments) comments, rank() OVER
(PARTITION BY category ORDER BY comments) rank1
FROM t
GROUP BY category
GROUPING SETS ((), (category))
HAVING max(comments) > 0;

return the following results:

NULL    1       48      1
2       1       49      1
3       1       49      1
8       1       49      1

Long story short when using grouping sets with the rank() function the
max() function return incorrect results. Everything works fine if I remove
grouping sets clause and split the query into two independent queries or
remove the rank() function.

This looks like a bug to me but please review. That said, I'm not sure if
it's just Amazon issue or general Hive issue.

Thanks,
Michal

-- 
Michal Krawczyk
Project Manager / Tech Lead
Union Square Internet Development
http://www.u2i.com/

Mime
View raw message