hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ericni (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-8590) With different parameters or column number dense_rank function gets different count distinct results
Date Fri, 24 Oct 2014 06:31:33 GMT
ericni created HIVE-8590:
----------------------------

             Summary: With different parameters  or column number dense_rank function gets
different count distinct results 
                 Key: HIVE-8590
                 URL: https://issues.apache.org/jira/browse/HIVE-8590
             Project: Hive
          Issue Type: Bug
          Components: UDF
    Affects Versions: 0.13.1
         Environment: cdh 4.6.0/hive0.13
            Reporter: ericni


We create a table with sql which contains the  dense_rank function,and then run count distinct
on this table,
we found that with diffrent dense_rank parameters or even defferent columns,we will get the
defferent count distinct results:
1.Less data will be ok(in our test case,200 million rows will get the same results,but 300
million rows will get the different results )
2.Different dense_rank parameters may be get the different results ,e.g  "dense_rank() over(distribute
by a,b sort by c desc)" and "dense_rank() over(distribute by a sort by c desc)"
3.All window functions(rank,row_number,dense_rank) have this problem
4.Less column number may be ok
5.Count(1) is ok,but Count distinct gets different results
6.It seems that some rows have been lost and some rows repeated 

test data(File is too large to upload.):
http://pan.baidu.com/s/1hqnCzze

test sql:
http://pan.baidu.com/s/1eQna8q2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message