Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of thattommyhall@gmail.com
 designates 209.85.216.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=svEmQc6w6g6gFmBP8TrnDg9hPGUR3jZKqJqdh1fo1t0z3PIHfrZwpzb/5UYI5q4anL
         4oQaPBDvFXcwAMCC5XK2spbKr2dQoafGQVhTatJ9fG/br8PqTjv6djo+1osuPw8lESU4
         SZTMSB+kPlDkrUUjyn9MJD+Dtas+QxXUM4KyU=
MIME-Version: 1.0
In-Reply-To: <BANLkTim_VG92dnG+fxC89NTSKAJBVvKgMw@mail.gmail.com>
References: <BANLkTim_VG92dnG+fxC89NTSKAJBVvKgMw@mail.gmail.com>
From: Tom Hall <thattommyhall@gmail.com>
Date: Mon, 9 May 2011 18:19:55 +0100
Message-ID: <BANLkTinaDtnAM+kG=RC4rR-YUHYGBMBVUQ@mail.gmail.com>
Subject: Re: Sequence File Compression
To: user@hive.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Anyone have an idea on this?
Anyone using compression with sequence files successfully?

The wiki and Hadoop: the definitive guide suggest that the below is
correct so I am at a loss to explain what we are seeing.


Tom


On Fri, May 6, 2011 at 5:39 PM, Tom Hall <thattommyhall@gmail.com> wrote:
> I have read http://wiki.apache.org/hadoop/Hive/CompressedStorage and
> am trying to compress some of our tables. The only difference with the
> eg is that we have partitioned our tables.
>
>
> SET io.seqfile.compression.type=3DBLOCK;
> SET hive.exec.compress.output=3Dtrue;
> SET mapred.output.compress=3Dtrue;
> insert overwrite table keywords_lzo partition (dated =3D '2010-01-25',
> client =3D 'TESTCLIENT') select
> account,campaign,ad_group,keyword_id,keyword,match_type,status,first_page=
_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,cam=
paign_status,currency_code,impressions,clicks,ctr,cpc,cost,avg_position,acc=
ount_id,campaign_id,adgroup_id
> from keywords where dated =3D '2010-01-25' and client =3D 'TESTCLIENT';
>
>
>
> The keywords_lzo table was created by running:
>
> CREATE TABLE `keywords_lzo` (
> =C2=A0`account` STRING,
> =C2=A0`campaign` STRING,
> =C2=A0`ad_group` STRING,
> =C2=A0`keyword_id` STRING,
> =C2=A0`keyword` STRING,
> =C2=A0`match_type` STRING,
> =C2=A0`status` STRING,
> =C2=A0`first_page_bid` STRING,
> =C2=A0`quality_score` FLOAT,
> =C2=A0`distribution` STRING,
> =C2=A0`max_cpc` FLOAT,
> =C2=A0`destination_url` STRING,
> =C2=A0`ad_group_status` STRING,
> =C2=A0`campaign_status` STRING,
> =C2=A0`currency_code` STRING,
> =C2=A0`impressions` INT,
> =C2=A0`clicks` INT,
> =C2=A0`ctr` FLOAT,
> =C2=A0`cpc` STRING,
> =C2=A0`cost` FLOAT ,
> =C2=A0`avg_position` FLOAT,
> =C2=A0`account_id` STRING,
> =C2=A0`campaign_id` STRING,
> =C2=A0`adgroup_id` STRING
> )
> PARTITIONED BY (
> `dated` STRING,
> `client` STRING
> )
>
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '\n'
>
> STORED AS SEQUENCEFILE;
>
>
> The problem is that the output is 12 files totaling the same size as
> the input CSV (~750MB). With either LZO or GZIP I get the same
> behaviour. If I use TEXTFILE then I get the compression I would
> expect.
>
> I can see in the head of the file
> SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text =EF=BF=
=BD'org.apache.hadoop.io.compress.GzipCodec
> and it does look like garbage so something is happening but the total
> size is not reduced from the CSV
>
>
> Is the 12 output files significant? They are ~60M each and the blocksize =
is 64M
> I tried SET io.seqfile.compression.type=3DRECORD; also but still 12
> files and no reduction in size.
>
>
> Thanks,
> Tom
>