I have read http://wiki.apache.org/hadoop/Hive/CompressedStorage and
am trying to compress some of our tables. The only difference with the
eg is that we have partitioned our tables.
SET io.seqfile.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapred.output.compress=true;
insert overwrite table keywords_lzo partition (dated = '2010-01-25',
client = 'TESTCLIENT') select
account,campaign,ad_group,keyword_id,keyword,match_type,status,first_page_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,campaign_status,currency_code,impressions,clicks,ctr,cpc,cost,avg_position,account_id,campaign_id,adgroup_id
from keywords where dated = '2010-01-25' and client = 'TESTCLIENT';
The keywords_lzo table was created by running:
CREATE TABLE `keywords_lzo` (
`account` STRING,
`campaign` STRING,
`ad_group` STRING,
`keyword_id` STRING,
`keyword` STRING,
`match_type` STRING,
`status` STRING,
`first_page_bid` STRING,
`quality_score` FLOAT,
`distribution` STRING,
`max_cpc` FLOAT,
`destination_url` STRING,
`ad_group_status` STRING,
`campaign_status` STRING,
`currency_code` STRING,
`impressions` INT,
`clicks` INT,
`ctr` FLOAT,
`cpc` STRING,
`cost` FLOAT ,
`avg_position` FLOAT,
`account_id` STRING,
`campaign_id` STRING,
`adgroup_id` STRING
)
PARTITIONED BY (
`dated` STRING,
`client` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS SEQUENCEFILE;
The problem is that the output is 12 files totaling the same size as
the input CSV (~750MB). With either LZO or GZIP I get the same
behaviour. If I use TEXTFILE then I get the compression I would
expect.
I can see in the head of the file
SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text
�'org.apache.hadoop.io.compress.GzipCodec
and it does look like garbage so something is happening but the total
size is not reduced from the CSV
Is the 12 output files significant? They are ~60M each and the blocksize is 64M
I tried SET io.seqfile.compression.type=RECORD; also but still 12
files and no reduction in size.
Thanks,
Tom
|