hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gill" <gill....@colenity.com>
Subject Re:HBase Read Performance - Multiget vs TableInputFormat Job
Date Mon, 06 Feb 2012 10:10:48 GMT
1. TableInputFormat splits the rows by regionLocation. with multiGet you should do it yourself
2. Get is Scan, multiGet means multiScan ? (not sure)
3. with Scan, you can use the feature of batch & caching 

------------------ Original ------------------
From:  "Jon Bender"<jonathan.bender@gmail.com>;
Date:  Mon, Feb 6, 2012 12:56 PM
To:  "user"<user@hbase.apache.org>; 

Subject:  HBase Read Performance - Multiget vs TableInputFormat Job


I've got a question about batch read performance in HBase.  I've got a
nightly job that extracts HBase data (currently upwards of ~300k new rows)
added from the previous day.  The rows are spread out fairly evenly over
the key range, so inevitably we will have to read from most, if not all
regions, to retrieve this data, and these reads will not be sequential
across rows.

The two alternatives I am exploring are

   1. Running a TableInputFormat MR job that filters for data added in the
   past day (Scan on the internal timestamp range of the cells)
   2. Using a batched get (multiGet) with a list of the rows were written
   the previous day, most likely using a number of HBase client processes to
   read this data out in parallel.

Does anyone have any recommendations on which approach to take?  I haven't
used the new MultiGet operations so I figured I'd ask the pros before
diving in.

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message