hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhaval Shah <prince_mithi...@yahoo.co.in>
Subject Re: Controlling TableMapReduceUtil table split points
Date Tue, 22 Jan 2013 23:10:43 GMT

Hi David.. We successfully use the "logical" schema approach and have not seen issues yet..
Ofcourse it all depends on the use case and saying it would work for you because it works
for us would be naive.. However, if it does work, it will make your life much easier because
with a logical schema other problems become simpler (like you can be sure that 1 map function
will process an entire row rather than a row going to multiple mappers, or if you are using
filters that restrict queries to only a small subset of the data, even setBatch won't be needed
for those use cases).. I did run into issues where I did not use setBatch and my mappers ran
out of memory but that was a simpler one to solve (and by the way if you are on CDH4, the
HBase export utility also does not use setBatch and your mapper will run out of memory if
you have a large row.. Its easy to put that line in though as a config param and this feature
is available in future releases of HBase


 From: David Koch <ogdude@googlemail.com>
To: user@hbase.apache.org 
Sent: Sunday, 6 January 2013 12:53 PM
Subject: Re: Controlling TableMapReduceUtil table split points
Hi Dhaval,

Good call on the setBatch. I had forgotten about it. Just like changing the
schema it would involve changing the map(...) to reflect the fact that only
part of the user's data is returned in each call but I would not have to
manipulate table splits.

The HBase book does suggest that it's bad practice to use the "logical"
schema of lumping all user data into a single row(*) but I'll do some
testing to see what works.

Thank you,


(*) Chapter 9, section "Tall-Narrow Versus Flat-Wide Tables", 3rd ed., page

On Sun, Jan 6, 2013 at 6:29 PM, Dhaval Shah <prince_mithibai@yahoo.co.in>wrote:

> Another option to avoid the timeout/oome issues is to use scan.setBatch()
> so that the scanner would function normally for small rows but would break
> up large rows in multiple Result objects which you can now use in
> conjunction with scan.setCaching() to control how much data you get back..
> This approach would not need a change in your schema design and would
> ensure that only 1 mapper processes the entire row (but in multiple calls
> to the map function)
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message