hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Explosion in datasize using HBase as a MR sink
Date Wed, 29 May 2013 16:20:11 GMT
Did you preslit Table2.1 ?

>From master log, do you see region splitting happen during the MR job run ?


On Wed, May 29, 2013 at 8:28 AM, Rob <robby.verkuylen@gmail.com> wrote:

> We're moving from ingesting our data via the Thrift API to inserting our
> records via a MapReduce job. For the MR job I've used the exact same job
> setup from HBase DefG, page 309. We're running CDH4.0.1, Hbase 0.92.1
> We are parsing data from a Hbase Table1 into a Hbase Table2, Table1 is
> unparsed data, Table2 is parsed and stored as a protobuf. This works fine
> when doing it via the Thrift API(in Python), this doesn't scale so we want
> to move to using a MR job.  Both T1 and T2 contain 100M records. Current
> stats, using 2GB region sizes:
> Table1: 130 regions, taking up 134Gb space
> Table2: 28 regions, taking up 39,3Gb space
> The problem arrises when I take a sample from Table1 of 6M records and M/R
> those into a new Table2.1. Those 6M records suddenly get spread over 178
> regions taking up 217.5GB of disk space.
> Both T2 and T2.1 have the following simple schema:
>         create 'Table2', {NAME => 'data', COMPRESSION => 'SNAPPY',
> VERSIONS => 1}
> I can retrieve and parse records from both T2 and T2.1, so the data is
> there and validated, but I can't seem to figure out why the explosion in
> size is happening. Triggering a major compaction does not differ much(2Gb
> in total size). I understand that snappy compression gets applied directly
> when RS's create store- and hfiles, so compression should be applied
> directly.
> Any thoughts?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message