Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0D2744977 for ; Mon, 9 May 2011 17:20:44 +0000 (UTC) Received: (qmail 48943 invoked by uid 500); 9 May 2011 17:20:43 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 48911 invoked by uid 500); 9 May 2011 17:20:43 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 48901 invoked by uid 99); 9 May 2011 17:20:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2011 17:20:43 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of thattommyhall@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2011 17:20:37 +0000 Received: by qwj9 with SMTP id 9so4500116qwj.35 for ; Mon, 09 May 2011 10:20:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=vxbeNeNuIkg9F9UwBiKO1q4Culbkf/xjxFwl44VCO90=; b=QSFlrDz5qxZduDe4fKwY5lklah3rl2VESlQNreSVc+AXPML+VuNlJjxjg4cm9io+yc k2LsBmEADFTy0uWYM+XURRbi5jP/7b6AOqRTVj98oSCB5Vhu5pkopbjwGlzLg9vJ+vG8 zZ8uUzKPQokthAYasIG7ds0f01I+U/ZMCw8Fg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=svEmQc6w6g6gFmBP8TrnDg9hPGUR3jZKqJqdh1fo1t0z3PIHfrZwpzb/5UYI5q4anL 4oQaPBDvFXcwAMCC5XK2spbKr2dQoafGQVhTatJ9fG/br8PqTjv6djo+1osuPw8lESU4 SZTMSB+kPlDkrUUjyn9MJD+Dtas+QxXUM4KyU= Received: by 10.229.102.85 with SMTP id f21mr5278841qco.25.1304961615190; Mon, 09 May 2011 10:20:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.247.81 with HTTP; Mon, 9 May 2011 10:19:55 -0700 (PDT) In-Reply-To: References: From: Tom Hall Date: Mon, 9 May 2011 18:19:55 +0100 Message-ID: Subject: Re: Sequence File Compression To: user@hive.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Anyone have an idea on this? Anyone using compression with sequence files successfully? The wiki and Hadoop: the definitive guide suggest that the below is correct so I am at a loss to explain what we are seeing. Tom On Fri, May 6, 2011 at 5:39 PM, Tom Hall wrote: > I have read http://wiki.apache.org/hadoop/Hive/CompressedStorage and > am trying to compress some of our tables. The only difference with the > eg is that we have partitioned our tables. > > > SET io.seqfile.compression.type=3DBLOCK; > SET hive.exec.compress.output=3Dtrue; > SET mapred.output.compress=3Dtrue; > insert overwrite table keywords_lzo partition (dated =3D '2010-01-25', > client =3D 'TESTCLIENT') select > account,campaign,ad_group,keyword_id,keyword,match_type,status,first_page= _bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,cam= paign_status,currency_code,impressions,clicks,ctr,cpc,cost,avg_position,acc= ount_id,campaign_id,adgroup_id > from keywords where dated =3D '2010-01-25' and client =3D 'TESTCLIENT'; > > > > The keywords_lzo table was created by running: > > CREATE TABLE `keywords_lzo` ( > =C2=A0`account` STRING, > =C2=A0`campaign` STRING, > =C2=A0`ad_group` STRING, > =C2=A0`keyword_id` STRING, > =C2=A0`keyword` STRING, > =C2=A0`match_type` STRING, > =C2=A0`status` STRING, > =C2=A0`first_page_bid` STRING, > =C2=A0`quality_score` FLOAT, > =C2=A0`distribution` STRING, > =C2=A0`max_cpc` FLOAT, > =C2=A0`destination_url` STRING, > =C2=A0`ad_group_status` STRING, > =C2=A0`campaign_status` STRING, > =C2=A0`currency_code` STRING, > =C2=A0`impressions` INT, > =C2=A0`clicks` INT, > =C2=A0`ctr` FLOAT, > =C2=A0`cpc` STRING, > =C2=A0`cost` FLOAT , > =C2=A0`avg_position` FLOAT, > =C2=A0`account_id` STRING, > =C2=A0`campaign_id` STRING, > =C2=A0`adgroup_id` STRING > ) > PARTITIONED BY ( > `dated` STRING, > `client` STRING > ) > > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t' > LINES TERMINATED BY '\n' > > STORED AS SEQUENCEFILE; > > > The problem is that the output is 12 files totaling the same size as > the input CSV (~750MB). With either LZO or GZIP I get the same > behaviour. If I use TEXTFILE then I get the compression I would > expect. > > I can see in the head of the file > SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text =EF=BF= =BD'org.apache.hadoop.io.compress.GzipCodec > and it does look like garbage so something is happening but the total > size is not reduced from the CSV > > > Is the 12 output files significant? They are ~60M each and the blocksize = is 64M > I tried SET io.seqfile.compression.type=3DRECORD; also but still 12 > files and no reduction in size. > > > Thanks, > Tom >