hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donald Matthews <>
Subject a GROUP BY that is not fully grouping
Date Tue, 01 Nov 2016 20:06:06 GMT
After upgrading someone's Spark/Hive program from Spark 1.5.2 to Spark
1.6.2, I've run into a GROUP BY that does not work reliably in the newer
version: the GROUP BY results are not always fully aggregated. Instead, I
get lots of duplicate + triplicate sets of group values. Seems like a Hive
bug to me, but before I create a Jira ticket I thought I should check with
the list.

Input:  A single table with 24 fields that I want to group on, and a few
other fields that I want to aggregate.

Statement: similar to
SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
FROM inputTable
GROUP BY a,b,c, ..., x

Checking the data for one sample run, I see that the input table has about
1.1M rows, with 18157 unique combinations of those 24 grouped values.

Expected output: A table of 18157 rows.

Observed output: A table of 28006 rows. Looking just at unique combinations
of those grouped fields, I see that while 10125 rows are unique as
expected, there are 6215 duplicate rows and 1817 triplicate rows.

This is not 100% repeatable. That is, I'll see the issue repeatedly one
day, but the next day with the same input data the GROUP BY will work
correctly. Anyway, for what it's worth I have captured parquets of the
input + output at a time when this issue kicked in.

For now it seems that I have a workaround: if I presort the input table by
the grouped fields, the GROUP BY works correctly. But of course I shouldn't
have to do that.

I was unable to find any recent reports of this sort of thing in the Hive
Jira or on this list. Which is odd. Does this misbehaviour seem familiar to


View raw message