hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Koch <ogd...@googlemail.com>
Subject Efficient way to sample from large HBase table.
Date Fri, 12 Oct 2012 15:04:31 GMT
Hello,

I need to sample 1million rows from a large HBase table. What is an
efficient way of doing this?

I thought about a RandomRowFilter on a scan of the source table to get
approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job
is running I would need an external counter to keep track of the number of
sampled records and stop the job at 1 million.

A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter
on the scan and then open a connection to the source table inside each
mapper to retrieve the values for the row key.

If there is a simpler more efficient way I would be glad to hear about it.

Thank you,

/David

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message