hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Chechik <dmi...@tellapart.com>
Subject HBase mapreduce and ResourceEstimator
Date Wed, 24 Mar 2010 20:26:36 GMT
Hi all,

We have an issue that occasionally crops up in the following scenario:
1. We have a fairly small HBase table. (say 400M)
2. We have a larger set of input from HDFS (say 1G bytes)

We run a mapreduce that joins this input (i.e., some of the mappers read
from HDFS, and some read from HBase).
The mappers that read from HBase all have TableSplits, which return 0 for
getLength().
The HDFS mappers have a non-zero getLength(), which is roughly the file size
of the HDFS input.

Because of this, the total input size of the job is roughly the size of the
HDFS input. Since the HBase table is small, the HBase mappers often finish
first. In that case, ResourceEstimator thinks that completedMapsInputSize is
near 0 (actually, it's equal to the number of HBase mapper tasks, which is
on the order of 10s for us.), but the completedMapsOutputSize is fairly
large (since it's the actual bytes output by the HBase reducers). So, the
ResourceEstimator thinks that the estimated total map output size is
roughly:

inputSize * completedMapsOutputSize * 2 / completedMapsInputSize
= (1G) * (400M * 2) / (32)

So the total estimated map output size is very large and the HDFS tasks wind
up pending, because we don't have enough resources.

Has anyone else run into this? One solution would be if TableSplit would
return a reasonable estimate of the size of each split, instead of 0, but it
looks like this isn't possible right now.

- Dmitry

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message