hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject RE: some idea about the Data Compression
Date Tue, 02 Jul 2013 16:18:59 GMT

1.       These files will probably be some standard format like .gz or .bz2 or .zip.  In that
case, pick an appropriate InputFormat.  See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/,

2.       Generally, compression is a Good Thing and will improve performance.  But only if
you use a fast compressor like LZO or Snappy.  Gzip, ZIP, BZ2, etc are no good for this. 
You also need to ensure that your compressed files are "splittable" if you are going to create
a single file that will be processed by a later MR stage, for this a SequenceFile is helpful.
 For typical intermediate outputs it doesn't matter as much because you will have a folder
of file parts and these are "pre split" in some sense.  Once upon a time, LZO compression
was a thing that you had to install as a separate component, but I think the modern distros
include it.  See for example: http://kickstarthadoop.blogspot.com/2012/02/use-compression-with-mapreduce.html
, http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/, http://my.safaribooksonline.com/book/software-engineering-and-development/9781449328917/compression/id3689058,
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression (section
4.2 in the Elephant book).


From: Geelong Yao [mailto:geelongyao@gmail.com]
Sent: Thursday, June 20, 2013 12:30 AM
To: user@hadoop.apache.org
Subject: some idea about the Data Compression

Hi , everyone

I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.

Can anyone give me some ideas on above 2 directions


>From Good To Great

View raw message