hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject With different parameters or column number dense_rank function gets different count distinct results
Date Wed, 29 Oct 2014 09:31:48 GMT
Hi Hive users,
We create a table with sql which contains the dense_rank function,and then run count distinct
on this table,
we found that with diffrent dense_rank parameters or even defferent columns,we will get the
defferent count distinct results:
1.Less data will be ok(in our test case,200 million rows will get the same results,but 300
million rows will get the different results )
2.Different dense_rank parameters may be get the different results ,e.g "dense_rank() over(distribute
by a,b sort by c desc)" and "dense_rank() over(distribute by a sort by c desc)"
3.All window functions(rank,row_number,dense_rank) have this problem
4.Less column number may be ok
5.Count(1) is ok,but Count distinct gets different results
6.It seems that some rows have been lost and some rows repeated

test data(File is too large to upload.):

test sql:

This communication is intended only for the addressee(s) and may contain information that
is privileged and confidential. You are hereby notified that, if you are not an intended recipient
listed above, or an authorized employee or agent of an addressee of this communication responsible
for delivering e-mail messages to an intended recipient, any dissemination, distribution or
reproduction of this communication (including any attachments hereto) is strictly prohibited.
If you have received this communication in error, please notify us immediately by a reply
e-mail addressed to the sender and permanently delete the original e-mail communication and
any attachments from all storage devices without making or otherwise retaining a copy.

View raw message