Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 14537 invoked from network); 5 Apr 2010 04:56:19 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Apr 2010 04:56:19 -0000 Received: (qmail 323 invoked by uid 500); 5 Apr 2010 04:56:17 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 163 invoked by uid 500); 5 Apr 2010 04:56:17 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 155 invoked by uid 99); 5 Apr 2010 04:56:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Apr 2010 04:56:16 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Apr 2010 04:56:10 +0000 Received: by wyb42 with SMTP id 42so1451404wyb.35 for ; Sun, 04 Apr 2010 21:55:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.154.205 with HTTP; Sun, 4 Apr 2010 21:55:49 -0700 (PDT) In-Reply-To: <4BB91368.7000006@gmail.com> References: <4BB77EA3.3030802@gmail.com> <4BB91368.7000006@gmail.com> Date: Sun, 4 Apr 2010 21:55:49 -0700 Received: by 10.216.166.80 with SMTP id f58mr3003372wel.187.1270443349190; Sun, 04 Apr 2010 21:55:49 -0700 (PDT) Message-ID: Subject: Re: Does Hadoop compress files? From: Eric Sammer To: u235sentinel@gmail.com Cc: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org See below. On Sun, Apr 4, 2010 at 3:32 PM, u235sentinel wrote= : > Ok that's what I was thinking. =A0I was wondering if Hadoop did on the fl= y > compression as it stored files in HDFS like Sensage does. =A0But it sound= s > like Hadoop will take a compressed file and store it as compressed which = is > fine by me. =A0Sensage will do that same. That's correct. > I believe this answers the question. =A0Sonal's link suggests there is su= pport > for compression using zlib, gzip and bzip2. > One more question though. =A0So storing files in compressed format, any i= ssues > with searching that data? =A0I'm curious if there is a disadvantage in do= ing > this. =A0I could build bigger and badder servers but was hoping for > compression. Just to be super specific about this, you can write data in any format into HDFS. If you can turn it into java primitives (including bytes), you can write it to HDFS. The second half of the question is what are my options for processing this data? If you plan on using Hadoop map reduce to process these files you'll want to make sure you use a compression format that Hadoop can "split" for parallel processing which only a subset of these are. If you aren't planning on using the MR component of Hadoop you can do whatever you'd like. You can still write map reduce jobs on non-splittable compression formats, but Hadoop will not be able to process a single file concurrently and instead will have to process an entire file in one task. The best option here is to dig into the docs a bit and figure out if what you want to do will be possible and take care of these details in the beginning. > Thanks > > > > Eric Sammer wrote: >> >> To clarify, there is no implicit compression in HDFS. In other words, >> if you want your data to be compressed, you have to write it that way. >> If you plan on writing map reduce jobs to process the compressed data, >> you'll want to use a splittable compression format. This generally >> means LZO or block compressed SequenceFiles which others have >> mentioned. >> >> >> >> >> > > --=20 Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com