hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Birchfield <jbirchfi...@stumbleupon.com>
Subject HBase Table Row Count Optimization - A Solicitation For Help
Date Fri, 20 Sep 2013 21:47:03 GMT
	After reading the documentation and scouring the mailing list archives, I understand there
is no real support for fast row counting in HBase unless you build some sort of tracking logic
into your code.  In our case, we do not have such logic, and have massive amounts of data
already persisted.  I am running into the issue of very long execution of the RowCounter MapReduce
job against very large tables (multi-billion for many is our estimate).  I understand why
this issue exists and am slowly accepting it, but I am hoping I can solicit some possible
ideas to help speed things up a little.
	
	My current task is to provide total row counts on about 600 tables, some extremely large,
some not so much.  Currently, I have a process that executes the MapRduce job in process like
so:
	
			Job job = RowCounter.createSubmittableJob(
					ConfigManager.getConfiguration(), new String[]{tableName});
			boolean waitForCompletion = job.waitForCompletion(true);
			Counters counters = job.getCounters();
			Counter rowCounter = counters.findCounter(hbaseadminconnection.Counters.ROWS);
			return rowCounter.getValue();
			
	At the moment, each MapReduce job is executed in serial order, so counting one table at a
time.  For the current implementation of this whole process, as it stands right now, my rough
timing calculations indicate that fully counting all the rows of these 600 tables will take
anywhere between 11 to 22 days.  This is not what I consider a desirable timeframe.

	I have considered three alternative approaches to speed things up.

	First, since the application is not heavily CPU bound, I could use a ThreadPool and execute
multiple MapReduce jobs at the same time looking at different tables.  I have never done this,
so I am unsure if this would cause any unanticipated side effects.  

	Second, I could distribute the processes.  I could find as many machines that can successfully
talk to the desired cluster properly, give them a subset of tables to work on, and then combine
the results post process.

	Third, I could combine both the above approaches and run a distributed set of multithreaded
process to execute the MapReduce jobs in parallel.

	Although it seems to have been asked and answered many times, I will ask once again.  Without
the need to change our current configurations or restart the clusters, is there a faster approach
to obtain row counts?  FYI, my cache size for the Scan is set to 1000.  I have experimented
with different numbers, but nothing made a noticeable difference.  Any advice or feedback
would be greatly appreciated!

Thanks,
Birch
Mime
View raw message