Return-Path: Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: (qmail 86908 invoked from network); 12 Jul 2010 18:45:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Jul 2010 18:45:51 -0000 Received: (qmail 39875 invoked by uid 500); 12 Jul 2010 18:45:50 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 39673 invoked by uid 500); 12 Jul 2010 18:45:49 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 39665 invoked by uid 99); 12 Jul 2010 18:45:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Jul 2010 18:45:49 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of msegel@navteq.com designates 204.120.70.37 as permitted sender) Received: from [204.120.70.37] (HELO xmailfargo.navteq.com) (204.120.70.37) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Jul 2010 18:45:43 +0000 Received: from fargo-spamfw.navteq.com (navteq-spamfw.navteq.com [10.6.10.131]) by xmailfargo.navteq.com (8.13.6/8.13.6) with ESMTP id o6CILTu7032728 for ; Mon, 12 Jul 2010 13:21:29 -0500 Received: from imailfargo.navteq.com (localhost [127.0.0.1]) by fargo-spamfw.navteq.com (Spam & Virus Firewall) with ESMTP id 57A6547D8766 for ; Mon, 12 Jul 2010 13:42:12 -0500 (CDT) Received: from imailfargo.navteq.com (imailfargo.navteq.com [10.6.10.130]) by fargo-spamfw.navteq.com with ESMTP id tLZ8RmBEfAMyYsi3 for ; Mon, 12 Jul 2010 13:42:12 -0500 (CDT) Received: from hq-ex-ht01.ad.navteq.com ([10.8.222.51]) by imailfargo.navteq.com (8.13.6/8.13.6) with ESMTP id o6CIgBdv003869 for ; Mon, 12 Jul 2010 13:42:12 -0500 Received: from hq-ex-mb02.ad.navteq.com ([fe80::880d:318e:e860:a8db]) by hq-ex-ht01.ad.navteq.com ([fe80::55a6:5b94:5c00:fe39%13]) with mapi; Mon, 12 Jul 2010 13:42:11 -0500 From: "Segel, Mike" To: "common-dev@hadoop.apache.org" Date: Mon, 12 Jul 2010 13:42:09 -0500 Subject: RE: Hadoop Compression - Current Status Thread-Topic: Hadoop Compression - Current Status Thread-Index: Acsh8CJQR6TU1ksZRpOJ+9hfYCVaPQAAZ8ig Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-NAIMIME-Disclaimer: 1 X-NAIMIME-Modified: 1 X-Virus-Checked: Checked by ClamAV on apache.org How can you say zip files are 'best codecs' to use? Call me silly but I seem to recall that if you're using a zip'd file for = input you can't really use a file splitter? (Going from memory, which isn't the best thing to do...) -Mike -----Original Message----- From: Stephen Watt [mailto:swatt@us.ibm.com]=20 Sent: Monday, July 12, 2010 1:28 PM To: common-dev@hadoop.apache.org Subject: Hadoop Compression - Current Status Please let me know if any of assertions are incorrect. I'm going to be=20 adding any feedback to the Hadoop Wiki. It seems well documented that the= =20 LZO Codec is the most performant codec ( http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html= )=20 but it is GPL infected and thus it is separately maintained here -=20 http://github.com/kevinweil/hadoop-lzo.=20 With regards to performance, and if you are not using sequential files,=20 Gzip is the next best codec to use, followed by bzip2. Hadoop has=20 supported being able to process bzip2 and gzip input formats for awhile=20 now but it could never split the files. i.e. it assigned one mapper per=20 file. There are now 2 new features : - Splitting bzip2 files available in 0.21.0 -=20 https://issues.apache.org/jira/browse/HADOOP-4012 - Splitting gzip files (in progress but patch available) -=20 https://issues.apache.org/jira/browse/MAPREDUCE-491 1) It appears most folks are using LZO. Given that it is GPL, are you not= =20 worried about it virally infecting your project ? 2) Is anyone using the new bzip2 or gzip file split compatible readers?=20 How do you like them? General feedback? Kind regards Steve Watt The information contained in this communication may be CONFIDENTIAL and i= s intended only for the use of the recipient(s) named above. If you are = not the intended recipient, you are hereby notified that any disseminatio= n, distribution, or copying of this communication, or any of its contents= , is strictly prohibited. If you have received this communication in err= or, please notify the sender and delete/destroy the original message and = any copy of it from your computer or paper files.