hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: HBase Read Performance - Multiget vs TableInputFormat Job
Date Mon, 06 Feb 2012 16:21:07 GMT
On Sun, Feb 5, 2012 at 8:56 PM, Jon Bender <jonathan.bender@gmail.com> wrote:
> The two alternatives I am exploring are
>   1. Running a TableInputFormat MR job that filters for data added in the
>   past day (Scan on the internal timestamp range of the cells)

You'll touch all your data when you do this.

What percentage of total data is the 300k new rows?

>   2. Using a batched get (multiGet) with a list of the rows were written
>   the previous day, most likely using a number of HBase client processes to
>   read this data out in parallel.

If you have the list of the 300k, this could work.  You could write a
mapreduce job that divided the 300k into maps and in each mapper run a
client to do  multiget (it'll sort the gets by regions for you).


View raw message