Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CBFDE39D1 for ; Fri, 6 May 2011 16:40:44 +0000 (UTC) Received: (qmail 49774 invoked by uid 500); 6 May 2011 16:40:44 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 49741 invoked by uid 500); 6 May 2011 16:40:44 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 49733 invoked by uid 99); 6 May 2011 16:40:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 May 2011 16:40:44 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of thattommyhall@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 May 2011 16:40:37 +0000 Received: by qyk30 with SMTP id 30so3353075qyk.14 for ; Fri, 06 May 2011 09:40:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=QoLBPQSwphDjBuqlWKwB/ZeZj05oGjkjbPPz9XmhGYI=; b=NIQ73P0vOxAHo1db2OTgj0hfntCnp8Yjah7VMDzFrivjEGy6WKz+fl5BMENNJSlR8j usb+7GYrN5pNWY9j5OEc4N8s4Am6ruUyJsdNjEwAYAlPer6/KrP5BrfUptX0vOzbJU7q 06CdhdYacJVl82yTTMMRYB5ihDdafXRX/oxKE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type :content-transfer-encoding; b=Tsfj+myjQACr9OOadr5R0qTpYLSwzIuGMwqGplhUk6WiAaHEMle3Vg3g23i8EO0+vU VIpMNuakJsbSs/rNJNcHlsDLa7IH/DtuNCV36F/wgq6jUgh8O+A6ARFi30RQwtlw2LE6 XVZrwHU3mN//8t8yZ0+jp6DbGcZZWbwgO+zXg= Received: by 10.224.198.8 with SMTP id em8mr3723394qab.305.1304700016725; Fri, 06 May 2011 09:40:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.35.204 with HTTP; Fri, 6 May 2011 09:39:56 -0700 (PDT) From: Tom Hall Date: Fri, 6 May 2011 17:39:56 +0100 Message-ID: Subject: Sequence File Compression To: user@hive.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I have read http://wiki.apache.org/hadoop/Hive/CompressedStorage and am trying to compress some of our tables. The only difference with the eg is that we have partitioned our tables. SET io.seqfile.compression.type=3DBLOCK; SET hive.exec.compress.output=3Dtrue; SET mapred.output.compress=3Dtrue; insert overwrite table keywords_lzo partition (dated =3D '2010-01-25', client =3D 'TESTCLIENT') select account,campaign,ad_group,keyword_id,keyword,match_type,status,first_page_b= id,quality_score,distribution,max_cpc,destination_url,ad_group_status,campa= ign_status,currency_code,impressions,clicks,ctr,cpc,cost,avg_position,accou= nt_id,campaign_id,adgroup_id from keywords where dated =3D '2010-01-25' and client =3D 'TESTCLIENT'; The keywords_lzo table was created by running: CREATE TABLE `keywords_lzo` ( `account` STRING, `campaign` STRING, `ad_group` STRING, `keyword_id` STRING, `keyword` STRING, `match_type` STRING, `status` STRING, `first_page_bid` STRING, `quality_score` FLOAT, `distribution` STRING, `max_cpc` FLOAT, `destination_url` STRING, `ad_group_status` STRING, `campaign_status` STRING, `currency_code` STRING, `impressions` INT, `clicks` INT, `ctr` FLOAT, `cpc` STRING, `cost` FLOAT , `avg_position` FLOAT, `account_id` STRING, `campaign_id` STRING, `adgroup_id` STRING ) PARTITIONED BY ( `dated` STRING, `client` STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE; The problem is that the output is 12 files totaling the same size as the input CSV (~750MB). With either LZO or GZIP I get the same behaviour. If I use TEXTFILE then I get the compression I would expect. I can see in the head of the file SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text =EF=BF=BD'org.apache.hadoop.io.compress.GzipCodec and it does look like garbage so something is happening but the total size is not reduced from the CSV Is the 12 output files significant? They are ~60M each and the blocksize is= 64M I tried SET io.seqfile.compression.type=3DRECORD; also but still 12 files and no reduction in size. Thanks, Tom