hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: Does Hadoop compress files?
Date Mon, 05 Apr 2010 04:55:49 GMT
See below.

On Sun, Apr 4, 2010 at 3:32 PM, u235sentinel <u235sentinel@gmail.com> wrote:
> Ok that's what I was thinking.  I was wondering if Hadoop did on the fly
> compression as it stored files in HDFS like Sensage does.  But it sounds
> like Hadoop will take a compressed file and store it as compressed which is
> fine by me.  Sensage will do that same.

That's correct.

> I believe this answers the question.  Sonal's link suggests there is support
> for compression using zlib, gzip and bzip2.
> One more question though.  So storing files in compressed format, any issues
> with searching that data?  I'm curious if there is a disadvantage in doing
> this.  I could build bigger and badder servers but was hoping for
> compression.

Just to be super specific about this, you can write data in any format
into HDFS. If you can turn it into java primitives (including bytes),
you can write it to HDFS. The second half of the question is what are
my options for processing this data? If you plan on using Hadoop map
reduce to process these files you'll want to make sure you use a
compression format that Hadoop can "split" for parallel processing
which only a subset of these are. If you aren't planning on using the
MR component of Hadoop you can do whatever you'd like. You can still
write map reduce jobs on non-splittable compression formats, but
Hadoop will not be able to process a single file concurrently and
instead will have to process an entire file in one task. The best
option here is to dig into the docs a bit and figure out if what you
want to do will be possible and take care of these details in the

> Thanks
> Eric Sammer wrote:
>> To clarify, there is no implicit compression in HDFS. In other words,
>> if you want your data to be compressed, you have to write it that way.
>> If you plan on writing map reduce jobs to process the compressed data,
>> you'll want to use a splittable compression format. This generally
>> means LZO or block compressed SequenceFiles which others have
>> mentioned.

Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

View raw message