hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <simon.2.thomp...@bt.com>
Subject Hash partition question
Date Wed, 23 Oct 2013 11:03:57 GMT
Hi there,

I have created a table of numbers using clustered by and am sampling it using buckets. 

If I am selecting 10000 candidates from ~125m how can I get good random selections?

Should I create 12500 clusters? Or should I create 100 clusters and then use the sample function
(... from 12500) ?

Simon

Mime
View raw message