hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Bender <jonathan.ben...@gmail.com>
Subject Re: HBase Read Performance - Multiget vs TableInputFormat Job
Date Mon, 06 Feb 2012 16:58:38 GMT
Thanks for the responses!

>What percentage of total data is the 300k new rows?

A constantly shrinking percentage--we may retain upwards of 5 years of data
here, so running against the full table will get very expensive going
forward.  I think the second approach sounds best.

>If you have the list of the 300k, this could work.  You could write a
mapreduce job that divided the 300k into maps and in each mapper run a
client to do  multiget (it'll sort the gets by regions for you).

When you say it'll sort regions by you, does that mean I'll need to
identify the regions before dividing up the maps?  Or just deal with the
fact that multiple maps might read from the same regionserver?

--Jon

On Mon, Feb 6, 2012 at 8:21 AM, Stack <stack@duboce.net> wrote:

> On Sun, Feb 5, 2012 at 8:56 PM, Jon Bender <jonathan.bender@gmail.com>
> wrote:
> > The two alternatives I am exploring are
> >
> >   1. Running a TableInputFormat MR job that filters for data added in the
> >   past day (Scan on the internal timestamp range of the cells)
>
> You'll touch all your data when you do this.
>
> What percentage of total data is the 300k new rows?
>
> >   2. Using a batched get (multiGet) with a list of the rows were written
> >   the previous day, most likely using a number of HBase client processes
> to
> >   read this data out in parallel.
> >
>
> If you have the list of the 300k, this could work.  You could write a
> mapreduce job that divided the 300k into maps and in each mapper run a
> client to do  multiget (it'll sort the gets by regions for you).
>
> St.Ack
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message