hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Birchfield <jbirchfi...@stumbleupon.com>
Subject Re: HBase Table Row Count Optimization - A Solicitation For Help
Date Fri, 20 Sep 2013 22:50:27 GMT
Hadoop 2.0.0-cdh4.3.1

HBase 0.94.6-cdh4.3.1

110 servers, 0 dead, 238.2364 average load

Some other info, not sure if it helps or not.

Configured Capacity: 1295277834158080 (1.15 PB)
Present Capacity: 1224692609430678 (1.09 PB)
DFS Remaining: 624376503857152 (567.87 TB)
DFS Used: 600316105573526 (545.98 TB)
DFS Used%: 49.02%
Under replicated blocks: 0
Blocks with corrupt replicas: 1
Missing blocks: 0

It is hitting a production cluster, but I am not really sure how to calculate the load placed
on the cluster.
On Sep 20, 2013, at 3:19 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> How many nodes do you have in your cluster ?
> When counting rows, what other load would be placed on the cluster ?
> What is the HBase version you're currently using / planning to use ?
> Thanks
> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
> jbirchfield@stumbleupon.com> wrote:
>>        After reading the documentation and scouring the mailing list
>> archives, I understand there is no real support for fast row counting in
>> HBase unless you build some sort of tracking logic into your code.  In our
>> case, we do not have such logic, and have massive amounts of data already
>> persisted.  I am running into the issue of very long execution of the
>> RowCounter MapReduce job against very large tables (multi-billion for many
>> is our estimate).  I understand why this issue exists and am slowly
>> accepting it, but I am hoping I can solicit some possible ideas to help
>> speed things up a little.
>>        My current task is to provide total row counts on about 600
>> tables, some extremely large, some not so much.  Currently, I have a
>> process that executes the MapRduce job in process like so:
>>                        Job job = RowCounter.createSubmittableJob(
>>                                        ConfigManager.getConfiguration(),
>> new String[]{tableName});
>>                        boolean waitForCompletion =
>> job.waitForCompletion(true);
>>                        Counters counters = job.getCounters();
>>                        Counter rowCounter =
>> counters.findCounter(hbaseadminconnection.Counters.ROWS);
>>                        return rowCounter.getValue();
>>        At the moment, each MapReduce job is executed in serial order, so
>> counting one table at a time.  For the current implementation of this whole
>> process, as it stands right now, my rough timing calculations indicate that
>> fully counting all the rows of these 600 tables will take anywhere between
>> 11 to 22 days.  This is not what I consider a desirable timeframe.
>>        I have considered three alternative approaches to speed things up.
>>        First, since the application is not heavily CPU bound, I could use
>> a ThreadPool and execute multiple MapReduce jobs at the same time looking
>> at different tables.  I have never done this, so I am unsure if this would
>> cause any unanticipated side effects.
>>        Second, I could distribute the processes.  I could find as many
>> machines that can successfully talk to the desired cluster properly, give
>> them a subset of tables to work on, and then combine the results post
>> process.
>>        Third, I could combine both the above approaches and run a
>> distributed set of multithreaded process to execute the MapReduce jobs in
>> parallel.
>>        Although it seems to have been asked and answered many times, I
>> will ask once again.  Without the need to change our current configurations
>> or restart the clusters, is there a faster approach to obtain row counts?
>> FYI, my cache size for the Scan is set to 1000.  I have experimented with
>> different numbers, but nothing made a noticeable difference.  Any advice or
>> feedback would be greatly appreciated!
>> Thanks,
>> Birch

View raw message