Return-Path: Delivered-To: apmail-hive-user-archive@www.apache.org Received: (qmail 10214 invoked from network); 19 Jan 2011 19:31:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Jan 2011 19:31:13 -0000 Received: (qmail 89229 invoked by uid 500); 19 Jan 2011 19:31:11 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 88814 invoked by uid 500); 19 Jan 2011 19:31:11 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 88680 invoked by uid 99); 19 Jan 2011 19:31:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jan 2011 19:31:11 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of swong@netflix.com designates 208.75.77.144 as permitted sender) Received: from [208.75.77.144] (HELO mx1.netflix.com) (208.75.77.144) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jan 2011 19:31:04 +0000 Received: from exchnmc101.netflix.com (exchnmc101.netflix.com [10.64.32.131]) by mx1.netflix.com (8.12.11.20060308/8.12.11) with ESMTP id p0JJUi9G010278 for ; Wed, 19 Jan 2011 11:30:44 -0800 Received: from ExchMBX104.netflix.com ([fe80::409a:eb58:3def:adb6]) by exchnmc101.netflix.com ([fe80::4894:38ba:ac0c:e4bc%13]) with mapi; Wed, 19 Jan 2011 11:30:43 -0800 From: Steven Wong To: "user@hive.apache.org" Date: Wed, 19 Jan 2011 11:30:39 -0800 Subject: RE: On compressed storage : why are sequence files bigger than text files? Thread-Topic: On compressed storage : why are sequence files bigger than text files? Thread-Index: Acu3L0go90w6XrevSfepknz9o7jYYgA32VLA Message-ID: <4F6B25AFFFCAFE44B6259A412D5F9B102DED82B3@ExchMBX104.netflix.com> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Here's a simple check -- look inside one of your sequence files: hadoop fs -cat /your/seq/file | head If it is compressed, the header will contain the compression codec's name a= nd the data will look gibberish. Otherwise, it is not compressed. -----Original Message----- From: Ajo Fod [mailto:ajo.fod@gmail.com]=20 Sent: Tuesday, January 18, 2011 8:46 AM To: user@hive.apache.org Subject: Re: On compressed storage : why are sequence files bigger than tex= t files? I tried 10M for blocksize ... the files are not any smaller. Also, I tried BZ2 compression codec ... it takes for ever ... the mapper ran for 10 mins and completed only 4% of the job on one partition. For comparison, with gzip, it took about 85secs. So, I terminated the job prematurely. In summary, what I started with ... gzip with textfiles seems to provide about 10% compression. I also tried out : set io.seqfile.compression.type =3D RECORD; I have feeling that compression is not turned on for sequence files for some reason. -Ajo. On Tue, Jan 18, 2011 at 7:28 AM, Edward Capriolo wr= ote: > On Tue, Jan 18, 2011 at 10:25 AM, Ajo Fod wrote: >> I tried with the gzip compression codec. BTW, what do you think of >> bz2, I've read that it is possible to split as input to different >> mappers ... is there a catch? >> >> Here are my flags now ... of these the last 2 were added per your sugges= tion. >> SET hive.enforce.bucketing=3DTRUE; >> set hive.exec.compress.output=3Dtrue; >> set hive.merge.mapfiles =3D false; >> set io.seqfile.compression.type =3D BLOCK; >> set io.seqfile.compress.blocksize=3D1000000; >> set mapred.output.compression.codec=3Dorg.apache.hadoop.io.compress.Gzip= Codec; >> >> >> The reuslts: >> text files result in about 18MB total (2 files) with compression ... >> as earlier ... BTW, takes 32sec to complete. >> sequence files are now stored in (2 files) totaling 244MB ... takes >> about 84 seconds. >> .. mind you the original was one file with 132MB. >> >> Cheers, >> -Ajo >> >> >> On Tue, Jan 18, 2011 at 6:36 AM, Edward Capriolo = wrote: >>> On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod wrote: >>>> Hello, >>>> >>>> My questions in short are: >>>> - why are sequencefiles bigger than textfiles (considering that they >>>> are binary)? >>>> - It looks like compression does not make for a smaller sequence file >>>> than the original text file. >>>> >>>> -- here is a sample data that is transfered into the tables below with >>>> an INSERT OVERWRITE >>>> A =A0 =A0 =A0 09:33:30 =A0 =A0 =A0 =A0N =A0 =A0 =A0 38.75 =A0 109100 = =A00 =A0 =A0 =A0 522486 =A040 >>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 200 =A0 = =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 >>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 = =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 >>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 = =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 >>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 = =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 >>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 100 =A0 = =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 >>>> A =A0 =A0 =A0 09:33:31 =A0 =A0 =A0 =A0M =A0 =A0 =A0 38.75 =A0 500 =A0 = =A0 0 =A0 =A0 =A0 0 =A0 =A0 =A0 0 >>>> >>>> -- so focusing on the column 4 and 5: >>>> -- text representation: columns 4 and 5 are 5 + 3 =3D 8 bytes long res= pectively. >>>> -- binary representation: columns 4 and 5 are 4 + 4=3D8 bytes long res= pectively. >>>> -- NOTE: I drop the last 3 columns in the table representation. >>>> >>>> -- The original size of one sample partition was 132MB=A0 ... extract = from : >>>> 132M 2011-01-16 18:20 data/2001-05-22 >>>> >>>> -- ... so =A0I set the following hive variables: >>>> >>>> set hive.exec.compress.output=3Dtrue; >>>> set hive.merge.mapfiles =3D false; >>>> set io.seqfile.compression.type =3D BLOCK; >>>> >>>> -- ... and create the following table. >>>> CREATE TABLE alltrades >>>> =A0 =A0 =A0 (symbol STRING, time STRING, exchange STRING, price FLOAT,= volume INT) >>>> PARTITIONED BY (dt STRING) >>>> CLUSTERED BY (symbol) >>>> SORTED BY (time ASC) >>>> INTO 4 BUCKETS >>>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' >>>> STORED AS TEXTFILE ; >>>> >>>> -- ... now the table is split into 2 files. (!! shouldn't this be 4 >>>> ... but that is discussed in the previous mail to this group) >>>> -- The bucket files total 17.5MB. >>>> 9,009,080 2011-01-18 05:32 >>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00= 13_m_000000_0.deflate >>>> 8,534,264 2011-01-18 05:32 >>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00= 13_m_000001_0.deflate >>>> >>>> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instea= d >>>> CREATE TABLE alltrades >>>> =A0 =A0 =A0 (symbol STRING, time STRING, exchange STRING, price FLOAT,= volume INT) >>>> PARTITIONED BY (dt STRING) >>>> CLUSTERED BY (symbol) >>>> SORTED BY (time ASC) >>>> INTO 4 BUCKETS >>>> STORED AS SEQUENCEFILE; >>>> >>>> ... this created files that were a total of 193MB (larger even than >>>> the original)!! >>>> 99,751,137 2011-01-18 05:24 >>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00= 07_m_000000_0 >>>> 93,859,644 2011-01-18 05:24 >>>> /user/hive/warehouse/alltrades/dt=3D2001-05-22/attempt_201101180504_00= 07_m_000001_0 >>>> >>>> So, in summary: >>>> Why are sequence files bigger than the original? >>>> >>>> >>>> -Ajo >>>> >>> >>> It looks like you have not explicitly set the compression codec or the >>> block size. This likely means you will end up with the Default Codec >>> and a block size that probably adds more overhead then compression. >>> Dont you just love this stuff? >>> >>> Experiment with these settings: >>> io.seqfile.compress.blocksize=3D1000000 >>> mapred.output.compression.codec=3Dorg.apache.hadoop.io.compress.GzipCod= ec >>> >> > > I may have been unclear. Try different =A0io.seqfile.compress.blocksize > 's (1,000,000 is not really that big) >