hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Explosion in datasize using HBase as a MR sink
Date Thu, 30 May 2013 02:51:19 GMT
On Wed, May 29, 2013 at 8:28 AM, Rob <robby.verkuylen@gmail.com> wrote:

> We're moving from ingesting our data via the Thrift API to inserting our
> records via a MapReduce job. For the MR job I've used the exact same job
> setup from HBase DefG, page 309. We're running CDH4.0.1, Hbase 0.92.1
> We are parsing data from a Hbase Table1 into a Hbase Table2, Table1 is
> unparsed data, Table2 is parsed and stored as a protobuf. This works fine
> when doing it via the Thrift API(in Python), this doesn't scale so we want
> to move to using a MR job.  Both T1 and T2 contain 100M records. Current
> stats, using 2GB region sizes:
> Table1: 130 regions, taking up 134Gb space
> Table2: 28 regions, taking up 39,3Gb space
> The problem arrises when I take a sample from Table1 of 6M records and M/R
> those into a new Table2.1. Those 6M records suddenly get spread over 178
> regions taking up 217.5GB of disk space.
> Both T2 and T2.1 have the following simple schema:
>         create 'Table2', {NAME => 'data', COMPRESSION => 'SNAPPY',
> VERSIONS => 1}
> I can retrieve and parse records from both T2 and T2.1, so the data is
> there and validated, but I can't seem to figure out why the explosion in
> size is happening. Triggering a major compaction does not differ much(2Gb
> in total size). I understand that snappy compression gets applied directly
> when RS's create store- and hfiles, so compression should be applied
> directly.

Triggering a major compaction does not alter the overall 217.5GB size?

You have speculative execution turned on in your MR job so its possible you
write many versions?

Does your MR job fail many tasks (and though it fails, until it fails, it
will have written some subset of the task hence bloating your versions?).

You are putting everything into protobufs?  Could that be bloating your
data?  Can you take a smaller subset and dump to the log a string version
of the pb.  Use TextFormat

It can be informative looking at hfile content.  It could give you a clue
as to the bloat.  See http://hbase.apache.org/book.html#hfile_tool


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message