hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: RowCounter example run time
Date Sun, 23 May 2010 15:33:47 GMT
> I don't have a set requirement.  Just trying to learn more about the system and 25 minutes
seemed excessive.  I really have nothing to compare against and have no expectations; but,
it takes about 900 seconds to run the count function in the shell.  My main goal is to figure
out what reasonable times are given similar setups or just to have a general idea of what's
acceptable so that I can make sure that everything is configured properly.

The shell uses scanner caching, from bin/HBase.rb:

      # We can safely set scanner caching with the first key only filter

So you can see how MR's overhead + lack of scanner caching = slow
count :P  so do configure your hbase-site.xml provided to your job
with adequate caching. This is a client-side config, no need to
restart HBase.

> I'm not sure how many regions there are per table.  My guess is whatever the default
is since this isn't an option I've tried to change.  However, I will look into it more and
update the thread.

HBase is an auto-sharded database where the region is the basic unit
of load distribution. Every table starts with 1 region and then grows
organically as you insert data (the regions are split in two). Check
your master's web UI, it will give you the region count. Click on that
table's name, it'll show you all the regions for that table.

> Again, my guess is that hbase.client.scanner.caching is 1 as you have mentioned.  When
calculating the size of a row, is this just the size of the data stored in the various columns
or do I need to factor in overhead also?  Do you have a reference or any guidance on the
optimal setting for the hbase.client.scanner.caching given the size of a typical row?  In
my case, I have about 8 rows, each storing a decimal value.  I haven't checked, but I'm assuming
these are being stored as doubles.

A cell in HBase is stored like this: row key + family key + qualifier
+ timestamp + actual data (all byte arrays). The size of your row is
the sum of all the cells size. Optimal setting is usually trying to
keep it under 12MB in order to not OOME the region servers. This is
actually enforced in HBase's next major version.

> Thanks!

View raw message