hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pamecha, Abhishek" <apame...@x.com>
Subject RE: Efficient way to sample from large HBase table.
Date Fri, 12 Oct 2012 18:06:45 GMT
Although, I have no idea of your use case, I would be surprised if during sampling you want
to stop exactly at the 1M mark.

Here is one approach you might use:
May be if you store the total count of rows separately say 90M, then you can randomly pick
1 in 90 rows in your MR job doing a global scan. If your key is uniformly distributed, you
can use mod-ranges and prefix filters to achieve that. This way, you don't have to instrument
your MR jobs to monitor current progress of jobs

A drawback with this approach though it that it is an full scan. But you may use the basic
idea above and restrict global to somewhat limited scan for efficiency at the loss of sampling
randomness.

hth,
Abhishek

-----Original Message-----
From: David Koch [mailto:ogdude@googlemail.com] 
Sent: Friday, October 12, 2012 8:05 AM
To: user@hbase.apache.org
Subject: Efficient way to sample from large HBase table.

Hello,

I need to sample 1million rows from a large HBase table. What is an efficient way of doing
this?

I thought about a RandomRowFilter on a scan of the source table to get approximately the right
amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job is running I would
need an external counter to keep track of the number of sampled records and stop the job at
1 million.

A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter on the scan and
then open a connection to the source table inside each mapper to retrieve the values for the
row key.

If there is a simpler more efficient way I would be glad to hear about it.

Thank you,

/David

Mime
View raw message