hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajo Fod <ajo....@gmail.com>
Subject Re: On compressed storage : why are sequence files bigger than text files?
Date Tue, 18 Jan 2011 15:25:51 GMT
I tried with the gzip compression codec. BTW, what do you think of
bz2, I've read that it is possible to split as input to different
mappers ... is there a catch?

Here are my flags now ... of these the last 2 were added per your suggestion.
SET hive.enforce.bucketing=TRUE;
set hive.exec.compress.output=true;
set hive.merge.mapfiles = false;
set io.seqfile.compression.type = BLOCK;
set io.seqfile.compress.blocksize=1000000;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;


The reuslts:
text files result in about 18MB total (2 files) with compression ...
as earlier ... BTW, takes 32sec to complete.
sequence files are now stored in (2 files) totaling 244MB ... takes
about 84 seconds.
.. mind you the original was one file with 132MB.

Cheers,
-Ajo


On Tue, Jan 18, 2011 at 6:36 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
> On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo.fod@gmail.com> wrote:
>> Hello,
>>
>> My questions in short are:
>> - why are sequencefiles bigger than textfiles (considering that they
>> are binary)?
>> - It looks like compression does not make for a smaller sequence file
>> than the original text file.
>>
>> -- here is a sample data that is transfered into the tables below with
>> an INSERT OVERWRITE
>> A       09:33:30        N       38.75   109100  0       522486  40
>> A       09:33:31        M       38.75   200     0       0      
0
>> A       09:33:31        M       38.75   100     0       0      
0
>> A       09:33:31        M       38.75   100     0       0      
0
>> A       09:33:31        M       38.75   100     0       0      
0
>> A       09:33:31        M       38.75   100     0       0      
0
>> A       09:33:31        M       38.75   500     0       0      
0
>>
>> -- so focusing on the column 4 and 5:
>> -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long respectively.
>> -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long respectively.
>> -- NOTE: I drop the last 3 columns in the table representation.
>>
>> -- The original size of one sample partition was 132MB  ... extract from <ls>
:
>> 132M 2011-01-16 18:20 data/2001-05-22
>>
>> -- ... so  I set the following hive variables:
>>
>> set hive.exec.compress.output=true;
>> set hive.merge.mapfiles = false;
>> set io.seqfile.compression.type = BLOCK;
>>
>> -- ... and create the following table.
>> CREATE TABLE alltrades
>>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
>> PARTITIONED BY (dt STRING)
>> CLUSTERED BY (symbol)
>> SORTED BY (time ASC)
>> INTO 4 BUCKETS
>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>> STORED AS TEXTFILE ;
>>
>> -- ... now the table is split into 2 files. (!! shouldn't this be 4
>> ... but that is discussed in the previous mail to this group)
>> -- The bucket files total 17.5MB.
>> 9,009,080 2011-01-18 05:32
>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate
>> 8,534,264 2011-01-18 05:32
>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate
>>
>> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead
>> CREATE TABLE alltrades
>>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
>> PARTITIONED BY (dt STRING)
>> CLUSTERED BY (symbol)
>> SORTED BY (time ASC)
>> INTO 4 BUCKETS
>> STORED AS SEQUENCEFILE;
>>
>> ... this created files that were a total of 193MB (larger even than
>> the original)!!
>> 99,751,137 2011-01-18 05:24
>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0
>> 93,859,644 2011-01-18 05:24
>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0
>>
>> So, in summary:
>> Why are sequence files bigger than the original?
>>
>>
>> -Ajo
>>
>
> It looks like you have not explicitly set the compression codec or the
> block size. This likely means you will end up with the Default Codec
> and a block size that probably adds more overhead then compression.
> Dont you just love this stuff?
>
> Experiment with these settings:
> io.seqfile.compress.blocksize=1000000
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
>

Mime
View raw message