hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob <robby.verkuy...@gmail.com>
Subject Re: Explosion in datasize using HBase as a MR sink
Date Wed, 29 May 2013 20:44:20 GMT
Yes the records get written, confirmed with a row count. We have REST services pointing at
T2 and work just fine pointing at T2.1. So the data in it is valid.

Since table VERSIONS is set to 1, I'm ruling out multiple versions, since these should be
removed, at least after a MC. But I run this job just once.

See below for the relevant diver and mapper parts of the job: 

Job job = new Job(conf, "T1 Parsed to T2.1");

/* Scan based on timerange */
Scan scan = new Scan();
setScannerAttributes(scan, conf); //sets time range scan values and caching

                        "T1", scan, MyMapper.class,
                        ImmutableBytesWritable.class, Put.class, job);
                        "T2.1", IdentityTableReducer.class, job);


public class MyMapper extends TableMapper<ImmutableBytesWritable, Writable> {

    public void map(ImmutableBytesWritable row, Result columns, Context context) throws  IOException

        try {
            Put put = new Put(row.get());
            //parse magic
	    put.add(colFam, colName, protobufEvent.toByteArray());

            context.write(row, put); 
	} catch (Ex…) {...}

On May 29, 2013, at 21:32, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. but does that account for the sizes?
> No. It should not.
> Can you tell us more about your MR job ?
> I assume that you have run RowCounter on Table2.1 to verify the number of
> rows matches 6M records.
> Cheers
> On Wed, May 29, 2013 at 12:27 PM, Rob <robby.verkuylen@gmail.com> wrote:
>> No I did not presplit and yes splits happen during the job run.
>> I know pre splitting is a best practice, but does that account for the
>> sizes?
>> On May 29, 2013, at 18:20, Ted Yu <yuzhihong@gmail.com> wrote:
>>> Did you preslit Table2.1 ?
>>> From master log, do you see region splitting happen during the MR job
>> run ?
>>> Thanks
>>> On Wed, May 29, 2013 at 8:28 AM, Rob <robby.verkuylen@gmail.com> wrote:
>>>> We're moving from ingesting our data via the Thrift API to inserting our
>>>> records via a MapReduce job. For the MR job I've used the exact same job
>>>> setup from HBase DefG, page 309. We're running CDH4.0.1, Hbase 0.92.1
>>>> We are parsing data from a Hbase Table1 into a Hbase Table2, Table1 is
>>>> unparsed data, Table2 is parsed and stored as a protobuf. This works
>> fine
>>>> when doing it via the Thrift API(in Python), this doesn't scale so we
>> want
>>>> to move to using a MR job.  Both T1 and T2 contain 100M records. Current
>>>> stats, using 2GB region sizes:
>>>> Table1: 130 regions, taking up 134Gb space
>>>> Table2: 28 regions, taking up 39,3Gb space
>>>> The problem arrises when I take a sample from Table1 of 6M records and
>> M/R
>>>> those into a new Table2.1. Those 6M records suddenly get spread over 178
>>>> regions taking up 217.5GB of disk space.
>>>> Both T2 and T2.1 have the following simple schema:
>>>>       create 'Table2', {NAME => 'data', COMPRESSION => 'SNAPPY',
>>>> VERSIONS => 1}
>>>> I can retrieve and parse records from both T2 and T2.1, so the data is
>>>> there and validated, but I can't seem to figure out why the explosion in
>>>> size is happening. Triggering a major compaction does not differ
>> much(2Gb
>>>> in total size). I understand that snappy compression gets applied
>> directly
>>>> when RS's create store- and hfiles, so compression should be applied
>>>> directly.
>>>> Any thoughts?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message