hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chalcy Raja <Chalcy.R...@careerbuilder.com>
Subject RE: hive - snappy and sequence file vs RC file
Date Wed, 27 Jun 2012 23:01:39 GMT
Snappy vs LZO - 
To implement lzo, there are several steps, starting from building hadoop-lzo library.  Finally
we got it built. Indexing had to be done as a separate step and the lzo indexing does alter
the way the files are stored and thus not use hadoop's in built mapper.  Snappy on the other
hand comes packages with Cloudera.  Since we are using Cloudera distribution, this makes sense
to us.  Lzo compresses better than snappy but for us that was okay since the performance is
better with snappy sequence file vs lzo

Rc file vs sequencefile - would have gone with RC file for all the resons given below but
for the reason like Bejoy said, sequence file is widely used.  Looks like sqoop may support
sequence file with hive import and since we are using sqoop a lot, sequence file is a better
choice.   

Also tested going back and forth from one compression to another compression and one file
format to another file format since that is possible, we can switch the compression or file
format later if we need to.

Thanks,
Chalcy

-----Original Message-----
From: yongqiang he [mailto:heyongqiangict@gmail.com] 
Sent: Wednesday, June 27, 2012 12:41 AM
To: user@hive.apache.org
Subject: Re: hive - snappy and sequence file vs RC file

Can you share the reason of choosing snappy as your compression codec?
Like @omalley mentioned, RCFile will compress the data more densely, and will avoid reading
data not required in your hive query. And I think Facebook use it to store tens of PB (if
not hundred PB) of data.

Thanks
Yongqiang
On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley <omalley@apache.org> wrote:
> SequenceFile compared to RCFile:
>   * More widely deployed.
>   * Available from MapReduce and Pig
>   * Doesn't compress as small (in RCFile all of each columns values 
> are put
> together)
>   * Uncompresses and deserializes all of the columns, even if you are 
> only reading a few
>
> In either case, for long term storage, you should seriously consider 
> the default codec since that will provide much tighter compression (at 
> the cost of cpu to compress it).
>
> -- Owen


Mime
View raw message