The problem seems related to sampling, a short answer would be based on Spark RDD.sample
If RDD.sample is still too slow for your requirement, then maybe https://en.wikipedia.org/wiki/Reservoir_sampling
is the direction to investigate, but not sure any existing implementation yet.
Reservoir sampling  Wikipedia<https://en.wikipedia.org/wiki/Reservoir_sampling>
en.wikipedia.org
Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of
k items from a list S containing n items, where n is either a very large or unknown number.
________________________________
From: Liu, Ming (Ming) <ming.liu@esgyn.cn>
Sent: Friday, April 13, 2018 12:16:07 AM
To: user@hbase.apache.org
Subject: how to get random rows from a big hbase table faster
Hi, all,
We have a hbase table which has 1 billion rows, and we want to randomly get 1M from that table.
We are now trying the RandomRowFilter, but it is still very slow. If I understand it correctly,
in the Server side, RandomRowFilter still need to read all 1 billions but return randomly
1% for them. But read 1 billion rows is very slow. Is this true?
So is there any other better way to randomly get 1% rows from a given table? Any idea will
be very appreciated.
We don't know the distribution of the 1 billion rows in advance.
Thanks,
Ming
