From James Birchfield <jbirchfi...@stumbleupon.com>
Subject Re: HBase Table Row Count Optimization - A Solicitation For Help
Date Sat, 21 Sep 2013 00:07:46 GMT
Agree with our first statement. I am in no way saying HBase is being used properly as a store.
 I am only saying my task is to determine the row counts as accurately as possible for the
data and setup we currently have.

I set the scan caching to 1000.  I tried 10000, but did not see much of a performance increase.

I will look further into coprocessors.  Since I am relatively new to the technology, can someone
provide a quick answer to this?  Will using a coprocessor require me to change and restart
our cluster?  I am assuming is is possibly a configuration thing?  If so, I will have to see
if that is an option.  If the answer is no, great.  If yes, and it is an option for me, I
will def take a look at this approach.

On Sep 20, 2013, at 4:56 PM, lars hofhansl <larsh@apache.org> wrote:

> Hi James,
> do you need that many tables? "Table" in HBase should have been call "KeySpace" instead.
600 is lot.
> But anyway... Did you enabled scanner caching for your M/R job (if you didn't every next()
will be a roundtrip to the RegionServer and you end up measuring your networks RTT)?
> Are you IO bound?
> Lastly instead of doing it as M/R (which has to bring all the data back to the mapper
just to count the returned rows), you could use a coprocessor, which do the counting on the
server (or use Phoenix, search back in the archives for an example that James Taylor gave
for row counting).
> -- Lars
> From: James Birchfield <jbirchfield@stumbleupon.com>
> To: user@hbase.apache.org 
> Sent: Friday, September 20, 2013 2:47 PM
> Subject: HBase Table Row Count Optimization - A Solicitation For Help
>     After reading the documentation and scouring the mailing list archives, I understand
there is no real support for fast row counting in HBase unless you build some sort of tracking
logic into your code.  In our case, we do not have such logic, and have massive amounts of
data already persisted.  I am running into the issue of very long execution of the RowCounter
MapReduce job against very large tables (multi-billion for many is our estimate).  I understand
why this issue exists and am slowly accepting it, but I am hoping I can solicit some possible
ideas to help speed things up a little.
>     My current task is to provide total row counts on about 600 tables, some extremely
large, some not so much.  Currently, I have a process that executes the MapRduce job in process
like so:
>             Job job = RowCounter.createSubmittableJob(
>                     ConfigManager.getConfiguration(), new String[]{tableName});
>             boolean waitForCompletion = job.waitForCompletion(true);
>             Counters counters = job.getCounters();
>             Counter rowCounter = counters.findCounter(hbaseadminconnection.Counters.ROWS);
>             return rowCounter.getValue();
>     At the moment, each MapReduce job is executed in serial order, so counting one table
at a time.  For the current implementation of this whole process, as it stands right now,
my rough timing calculations indicate that fully counting all the rows of these 600 tables
will take anywhere between 11 to 22 days.  This is not what I consider a desirable timeframe.
>     I have considered three alternative approaches to speed things up.
>     First, since the application is not heavily CPU bound, I could use a ThreadPool and
execute multiple MapReduce jobs at the same time looking at different tables.  I have never
done this, so I am unsure if this would cause any unanticipated side effects.  
>     Second, I could distribute the processes.  I could find as many machines that can
successfully talk to the desired cluster properly, give them a subset of tables to work on,
and then combine the results post process.
>     Third, I could combine both the above approaches and run a distributed set of multithreaded
process to execute the MapReduce jobs in parallel.
>     Although it seems to have been asked and answered many times, I will ask once again.
 Without the need to change our current configurations or restart the clusters, is there a
faster approach to obtain row counts?  FYI, my cache size for the Scan is set to 1000.  I
have experimented with different numbers, but nothing made a noticeable difference.  Any advice
or feedback would be greatly appreciated!
> Thanks,
> Birch

