hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob <robby.verkuy...@gmail.com>
Subject Explosion in datasize using HBase as a MR sink
Date Wed, 29 May 2013 15:28:24 GMT

We're moving from ingesting our data via the Thrift API to inserting our records via a MapReduce
job. For the MR job I've used the exact same job setup from HBase DefG, page 309. We're running
CDH4.0.1, Hbase 0.92.1

We are parsing data from a Hbase Table1 into a Hbase Table2, Table1 is unparsed data, Table2
is parsed and stored as a protobuf. This works fine when doing it via the Thrift API(in Python),
this doesn't scale so we want to move to using a MR job.  Both T1 and T2 contain 100M records.
Current stats, using 2GB region sizes:

Table1: 130 regions, taking up 134Gb space
Table2: 28 regions, taking up 39,3Gb space

The problem arrises when I take a sample from Table1 of 6M records and M/R those into a new
Table2.1. Those 6M records suddenly get spread over 178 regions taking up 217.5GB of disk

Both T2 and T2.1 have the following simple schema:
	create 'Table2', {NAME => 'data', COMPRESSION => 'SNAPPY', VERSIONS => 1}

I can retrieve and parse records from both T2 and T2.1, so the data is there and validated,
but I can't seem to figure out why the explosion in size is happening. Triggering a major
compaction does not differ much(2Gb in total size). I understand that snappy compression gets
applied directly when RS's create store- and hfiles, so compression should be applied directly.

Any thoughts?
View raw message