Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 877D0CC81 for ; Wed, 27 Jun 2012 23:02:09 +0000 (UTC) Received: (qmail 67436 invoked by uid 500); 27 Jun 2012 23:02:08 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 67374 invoked by uid 500); 27 Jun 2012 23:02:08 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 67366 invoked by uid 99); 27 Jun 2012 23:02:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2012 23:02:08 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Chalcy.Raja@careerbuilder.com designates 64.212.116.17 as permitted sender) Received: from [64.212.116.17] (HELO SVR-PR-EDGE2.cb.careerbuilder.com) (64.212.116.17) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2012 23:02:01 +0000 Received: from SVR-PR-HUB1.cb.careerbuilder.com (10.64.5.20) by SVR-PR-EDGE2.cb.careerbuilder.com (10.64.6.62) with Microsoft SMTP Server (TLS) id 14.2.298.4; Wed, 27 Jun 2012 19:01:32 -0400 Received: from SVR-PR-MB2.cb.careerbuilder.com ([169.254.2.37]) by SVR-PR-HUB1.cb.careerbuilder.com ([10.64.5.20]) with mapi id 14.02.0298.004; Wed, 27 Jun 2012 19:01:39 -0400 From: Chalcy Raja To: "user@hive.apache.org" Subject: RE: hive - snappy and sequence file vs RC file Thread-Topic: hive - snappy and sequence file vs RC file Thread-Index: Ac1Tm78eCPAT6dfoRxSNWNjgYzw97wAJGneAAARomCAAAtTQgAAY22SAAAahwpA= Date: Wed, 27 Jun 2012 23:01:39 +0000 Message-ID: <15C962F3417BF94ABEAB2314AF92A16A160A6A74@SVR-PR-MB2.cb.careerbuilder.com> References: <15C962F3417BF94ABEAB2314AF92A16A160A43AF@SVR-PR-MB2.cb.careerbuilder.com> <1340716905.78637.YahooMailNeo@web121203.mail.ne1.yahoo.com> <15C962F3417BF94ABEAB2314AF92A16A160A45B3@SVR-PR-MB2.cb.careerbuilder.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.10.7.61] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Snappy vs LZO -=20 To implement lzo, there are several steps, starting from building hadoop-lz= o library. Finally we got it built. Indexing had to be done as a separate = step and the lzo indexing does alter the way the files are stored and thus = not use hadoop's in built mapper. Snappy on the other hand comes packages = with Cloudera. Since we are using Cloudera distribution, this makes sense = to us. Lzo compresses better than snappy but for us that was okay since th= e performance is better with snappy sequence file vs lzo Rc file vs sequencefile - would have gone with RC file for all the resons g= iven below but for the reason like Bejoy said, sequence file is widely used= . Looks like sqoop may support sequence file with hive import and since we= are using sqoop a lot, sequence file is a better choice. =20 Also tested going back and forth from one compression to another compressio= n and one file format to another file format since that is possible, we can= switch the compression or file format later if we need to. Thanks, Chalcy -----Original Message----- From: yongqiang he [mailto:heyongqiangict@gmail.com]=20 Sent: Wednesday, June 27, 2012 12:41 AM To: user@hive.apache.org Subject: Re: hive - snappy and sequence file vs RC file Can you share the reason of choosing snappy as your compression codec? Like @omalley mentioned, RCFile will compress the data more densely, and wi= ll avoid reading data not required in your hive query. And I think Facebook= use it to store tens of PB (if not hundred PB) of data. Thanks Yongqiang On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley wrote: > SequenceFile compared to RCFile: > =A0 * More widely deployed. > =A0 * Available from MapReduce and Pig > =A0 * Doesn't compress as small (in RCFile all of each columns values=20 > are put > together) > =A0 * Uncompresses and deserializes all of the columns, even if you are=20 > only reading a few > > In either case, for long term storage, you should seriously consider=20 > the default codec since that will provide much tighter compression (at=20 > the cost of cpu to compress it). > > -- Owen