hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "calugaru.cristi" <calugaru.cri...@gmail.com>
Subject HBase bulk load usage
Date Fri, 29 Nov 2013 09:33:02 GMT
I am trying to import some HDFS data to an already existing HBase table. The
table I have was created with 2 column families, and with all the default
settings that HBase comes with when creating a new table. The table is
already filled up with a large volume of data, and it has 98 online regions.
The type of row keys it has, are under the form of(simplified version) :

Example of key:

The data I want to import is on HDFS, and I am using a Map-Reduce process to
read it. I emit Put objects from my mapper, which correspond to each line
read from the HDFS files. The existing data has keys which will all start
with "XX181113". The job is configured with :

HFileOutputFormat.configureIncrementalLoad(job, hTable)
Once I start the process, I see it configured with 98 reducers (equal to the
online regions the table has), but the issue is that 4 reducers got 100% of
the data split among them, while the rest did nothing. As a result, I see
only 4 folder outputs, which have a very large size. Are these files
corresponding to 4 new regions which I can then import to the table? And if
so, why only 4, while 98 reducers get created? Reading HBase docs:

"In order to function efficiently, HFileOutputFormat must be configured such
that each output HFile fits within a single region. In order to do this,
jobs whose output will be bulk loaded into HBase use Hadoop's
TotalOrderPartitioner class to partition the map output into disjoint ranges
of the key space, corresponding to the key ranges of the regions in the

confused me even more as to why I get this behaviour.


View this message in context: http://apache-hbase.679495.n3.nabble.com/HBase-bulk-load-usage-tp4053231.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message