Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of thattommyhall@gmail.com
 designates 209.85.216.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:from:date:message-id:subject:to:content-type
         :content-transfer-encoding;
        b=Tsfj+myjQACr9OOadr5R0qTpYLSwzIuGMwqGplhUk6WiAaHEMle3Vg3g23i8EO0+vU
         VIpMNuakJsbSs/rNJNcHlsDLa7IH/DtuNCV36F/wgq6jUgh8O+A6ARFi30RQwtlw2LE6
         XVZrwHU3mN//8t8yZ0+jp6DbGcZZWbwgO+zXg=
MIME-Version: 1.0
From: Tom Hall <thattommyhall@gmail.com>
Date: Fri, 6 May 2011 17:39:56 +0100
Message-ID: <BANLkTim_VG92dnG+fxC89NTSKAJBVvKgMw@mail.gmail.com>
Subject: Sequence File Compression
To: user@hive.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I have read http://wiki.apache.org/hadoop/Hive/CompressedStorage and
am trying to compress some of our tables. The only difference with the
eg is that we have partitioned our tables.


SET io.seqfile.compression.type=3DBLOCK;
SET hive.exec.compress.output=3Dtrue;
SET mapred.output.compress=3Dtrue;
insert overwrite table keywords_lzo partition (dated =3D '2010-01-25',
client =3D 'TESTCLIENT') select
account,campaign,ad_group,keyword_id,keyword,match_type,status,first_page_b=
id,quality_score,distribution,max_cpc,destination_url,ad_group_status,campa=
ign_status,currency_code,impressions,clicks,ctr,cpc,cost,avg_position,accou=
nt_id,campaign_id,adgroup_id
from keywords where dated =3D '2010-01-25' and client =3D 'TESTCLIENT';


The keywords_lzo table was created by running:

CREATE TABLE `keywords_lzo` (
  `account` STRING,
  `campaign` STRING,
  `ad_group` STRING,
  `keyword_id` STRING,
  `keyword` STRING,
  `match_type` STRING,
  `status` STRING,
  `first_page_bid` STRING,
  `quality_score` FLOAT,
  `distribution` STRING,
  `max_cpc` FLOAT,
  `destination_url` STRING,
  `ad_group_status` STRING,
  `campaign_status` STRING,
  `currency_code` STRING,
  `impressions` INT,
  `clicks` INT,
  `ctr` FLOAT,
  `cpc` STRING,
  `cost` FLOAT ,
  `avg_position` FLOAT,
  `account_id` STRING,
  `campaign_id` STRING,
  `adgroup_id` STRING
)
PARTITIONED BY (
`dated` STRING,
`client` STRING
)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'

STORED AS SEQUENCEFILE;


The problem is that the output is 12 files totaling the same size as
the input CSV (~750MB). With either LZO or GZIP I get the same
behaviour. If I use TEXTFILE then I get the compression I would
expect.

I can see in the head of the file
SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text
=EF=BF=BD'org.apache.hadoop.io.compress.GzipCodec
and it does look like garbage so something is happening but the total
size is not reduced from the CSV


Is the 12 output files significant? They are ~60M each and the blocksize is=
 64M
I tried SET io.seqfile.compression.type=3DRECORD; also but still 12
files and no reduction in size.


Thanks,
Tom