Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <4BB91368.7000006@gmail.com>
References: <4BB77EA3.3030802@gmail.com>
	 <o2v23d54121004040146rce87c925l1762f60d7e01e4b0@mail.gmail.com>
	 <4BB91368.7000006@gmail.com>
Date: Sun, 4 Apr 2010 21:55:49 -0700
Message-ID: <x2z23d54121004042155nbf60f4a8t692da3cc5021c13a@mail.gmail.com>
Subject: Re: Does Hadoop compress files?
From: Eric Sammer <esammer@cloudera.com>
To: u235sentinel@gmail.com
Cc: common-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

See below.

On Sun, Apr 4, 2010 at 3:32 PM, u235sentinel <u235sentinel@gmail.com> wrote=
:
> Ok that's what I was thinking. =A0I was wondering if Hadoop did on the fl=
y
> compression as it stored files in HDFS like Sensage does. =A0But it sound=
s
> like Hadoop will take a compressed file and store it as compressed which =
is
> fine by me. =A0Sensage will do that same.

That's correct.

> I believe this answers the question. =A0Sonal's link suggests there is su=
pport
> for compression using zlib, gzip and bzip2.
> One more question though. =A0So storing files in compressed format, any i=
ssues
> with searching that data? =A0I'm curious if there is a disadvantage in do=
ing
> this. =A0I could build bigger and badder servers but was hoping for
> compression.

Just to be super specific about this, you can write data in any format
into HDFS. If you can turn it into java primitives (including bytes),
you can write it to HDFS. The second half of the question is what are
my options for processing this data? If you plan on using Hadoop map
reduce to process these files you'll want to make sure you use a
compression format that Hadoop can "split" for parallel processing
which only a subset of these are. If you aren't planning on using the
MR component of Hadoop you can do whatever you'd like. You can still
write map reduce jobs on non-splittable compression formats, but
Hadoop will not be able to process a single file concurrently and
instead will have to process an entire file in one task. The best
option here is to dig into the docs a bit and figure out if what you
want to do will be possible and take care of these details in the
beginning.

> Thanks
>
>
>
> Eric Sammer wrote:
>>
>> To clarify, there is no implicit compression in HDFS. In other words,
>> if you want your data to be compressed, you have to write it that way.
>> If you plan on writing map reduce jobs to process the compressed data,
>> you'll want to use a splittable compression format. This generally
>> means LZO or block compressed SequenceFiles which others have
>> mentioned.
>>
>>
>>
>>
>>
>
>


--=20
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com