Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of swong@netflix.com designates
 208.75.77.144 as permitted sender)
From: Steven Wong <swong@netflix.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Date: Wed, 19 Jan 2011 11:30:39 -0800
Subject: RE: On compressed storage : why are sequence files bigger than text
 files?
Thread-Topic: On compressed storage : why are sequence files bigger than
 text files?
Thread-Index: Acu3L0go90w6XrevSfepknz9o7jYYgA32VLA
Message-ID: <4F6B25AFFFCAFE44B6259A412D5F9B102DED82B3@ExchMBX104.netflix.com>
References: <AANLkTimefMNXGMRQcUkQ4cBN+VCmeRZaUJ4VV4hDJ_FV@mail.gmail.com>
	<AANLkTin=5TDyOB3=xSFD7v7ULvwUg5x00uKeezT8K9r9@mail.gmail.com>
	<AANLkTimDDadHdHX+hHsJAY6ZT3rfVeKmV2Kwwhbc6RkO@mail.gmail.com>
	<AANLkTi=3AqYPWkWGkT50pPMGm7S91YqVzn7Z12f96JwA@mail.gmail.com>
 <AANLkTi=qpGBgmc+rd3_DHF8xZku1Eb-dggDRVM=tJj+J@mail.gmail.com>
In-Reply-To: <AANLkTi=qpGBgmc+rd3_DHF8xZku1Eb-dggDRVM=tJj+J@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Here's a simple check -- look inside one of your sequence files:

hadoop fs -cat /your/seq/file | head

If it is compressed, the header will contain the compression codec's name a=
nd the data will look gibberish. Otherwise, it is not compressed.


-----Original Message-----
From: Ajo Fod [mailto:ajo.fod@gmail.com]=20
Sent: Tuesday, January 18, 2011 8:46 AM
To: user@hive.apache.org
Subject: Re: On compressed storage : why are sequence files bigger than tex=
t files?

I tried 10M for blocksize ... the files are not any smaller.

Also, I tried BZ2 compression codec ... it takes for ever ... the
mapper ran for 10 mins and completed only 4% of the job on one
partition. For comparison, with gzip, it took about 85secs. So, I
terminated the job prematurely.

In summary, what I started with ... gzip with textfiles seems to
provide about 10% compression. I also tried out :
set io.seqfile.compression.type =3D RECORD;

I have feeling that compression is not turned on for sequence files
for some reason.

-Ajo.

On Tue, Jan 18, 2011 at 7:28 AM, Edward Capriolo <edlinuxguru@gmail.com> wr=
ote:
> On Tue, Jan 18, 2011 at 10:25 AM, Ajo Fod <ajo.fod@gmail.com> wrote:
>> I tried with the gzip compression codec. BTW, what do you think of
>> bz2, I've read that it is possible to split as input to different
>> mappers ... is there a catch?
>>
>> Here are my flags now ... of these the last 2 were added per your sugges=
tion.
>> SET hive.enforce.bucketing=3DTRUE;
>> set hive.exec.compress.output=3Dtrue;
>> set hive.merge.mapfiles =3D false;
>> set io.seqfile.compression.type =3D BLOCK;
>> set io.seqfile.compress.blocksize=3D1000000;
>> set mapred.output.compression.codec=3Dorg.apache.hadoop.io.compress.Gzip=
Codec;
>>
>>
>> The reuslts:
>> text files result in about 18MB total (2 files) with compression ...
>> as earlier ... BTW, takes 32sec to complete.
>> sequence files are now stored in (2 files) totaling 244MB ... takes
>> about 84 seconds.
>> .. mind you the original was one file with 132MB.
>>
>> Cheers,
>> -Ajo
>>
>>
>> On Tue, Jan 18, 2011 at 6:36 AM, Edward Capriolo <edlinuxguru@gmail.com>=
 wrote:
>>> On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo.fod@gmail.com> wrote:
>>>> Hello,
>>>>
>>>> My questions in short are:
>>>> - why are sequencefiles bigger than textfiles (considering that they
>>>> are binary)?
>>>> - It looks like compression does not make for a smaller sequence file
>>>> than the original text file.
>>>>
>>>> -- here is a sample data that is transfered into the tables below with
>>>> an INSERT OVERWRITE
>>>> A =A0 =A0 =A0 09:33:30 =A0 =A0 =A0 =A0N =A0 =A0 =A0 38.75 =A0 109100 =
=A00 =A0 =A0 =A0 522486 =A040
>>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 200 =A0 =
=A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0
>>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 =
=A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0
>>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 =
=A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0
>>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 =
=A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0
>>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 =
=A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0
>>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 500 =A0 =
=A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0
>>>>
>>>> -- so focusing on the column 4 and 5:
>>>> -- text representation: columns 4 and 5 are 5 + 3 =3D 8 bytes long res=
pectively.
>>>> -- binary representation: columns 4 and 5 are 4 + 4=3D8 bytes long res=
pectively.
>>>> -- NOTE: I drop the last 3 columns in the table representation.
>>>>
>>>> -- The original size of one sample partition was 132MB=A0 ... extract =
from <ls> :
>>>> 132M 2011-01-16 18:20 data/2001-05-22
>>>>
>>>> -- ... so =A0I set the following hive variables:
>>>>
>>>> set hive.exec.compress.output=3Dtrue;
>>>> set hive.merge.mapfiles =3D false;
>>>> set io.seqfile.compression.type =3D BLOCK;
>>>>
>>>> -- ... and create the following table.
>>>> CREATE TABLE alltrades
>>>> =A0 =A0 =A0 (symbol STRING, time STRING, exchange STRING, price FLOAT,=
 volume INT)
>>>> PARTITIONED BY (dt STRING)
>>>> CLUSTERED BY (symbol)
>>>> SORTED BY (time ASC)
>>>> INTO 4 BUCKETS
>>>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>>> STORED AS TEXTFILE ;
>>>>
>>>> -- ... now the table is split into 2 files. (!! shouldn't this be 4
>>>> ... but that is discussed in the previous mail to this group)
>>>> -- The bucket files total 17.5MB.
>>>> 9,009,080 2011-01-18 05:32
>>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00=
13_m_000000_0.deflate
>>>> 8,534,264 2011-01-18 05:32
>>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00=
13_m_000001_0.deflate
>>>>
>>>> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instea=
d
>>>> CREATE TABLE alltrades
>>>> =A0 =A0 =A0 (symbol STRING, time STRING, exchange STRING, price FLOAT,=
 volume INT)
>>>> PARTITIONED BY (dt STRING)
>>>> CLUSTERED BY (symbol)
>>>> SORTED BY (time ASC)
>>>> INTO 4 BUCKETS
>>>> STORED AS SEQUENCEFILE;
>>>>
>>>> ... this created files that were a total of 193MB (larger even than
>>>> the original)!!
>>>> 99,751,137 2011-01-18 05:24
>>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00=
07_m_000000_0
>>>> 93,859,644 2011-01-18 05:24
>>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00=
07_m_000001_0
>>>>
>>>> So, in summary:
>>>> Why are sequence files bigger than the original?
>>>>
>>>>
>>>> -Ajo
>>>>
>>>
>>> It looks like you have not explicitly set the compression codec or the
>>> block size. This likely means you will end up with the Default Codec
>>> and a block size that probably adds more overhead then compression.
>>> Dont you just love this stuff?
>>>
>>> Experiment with these settings:
>>> io.seqfile.compress.blocksize=3D1000000
>>> mapred.output.compression.codec=3Dorg.apache.hadoop.io.compress.GzipCod=
ec
>>>
>>
>
> I may have been unclear. Try different =A0io.seqfile.compress.blocksize
> 's (1,000,000 is not really that big)
>