Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 4954 invoked from network); 3 Mar 2009 08:44:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Mar 2009 08:44:24 -0000 Received: (qmail 25713 invoked by uid 500); 3 Mar 2009 08:44:18 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 25665 invoked by uid 500); 3 Mar 2009 08:44:17 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 25654 invoked by uid 99); 3 Mar 2009 08:44:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Mar 2009 00:44:17 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of timrobertson100@gmail.com designates 209.85.220.157 as permitted sender) Received: from [209.85.220.157] (HELO mail-fx0-f157.google.com) (209.85.220.157) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Mar 2009 08:44:08 +0000 Received: by fxm1 with SMTP id 1so2528846fxm.29 for ; Tue, 03 Mar 2009 00:43:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=X3B0cYlOutA6zSt8J4H4NLluslIPt3mfKrkxguSiKGM=; b=luqhqQP/DawAegFQBUe9WNeyjApg+qwG8ESvRBWCXosVgteXNX1set2devuGZASUQ0 LvfyJQS0hUgdGyiLekO15LY7hOqZwbiWEWgRCxM0GB6JOr3UDWT0mu6dMTy+/Pk2nm/z O20eKs3e2G6ymmXrMdOxUy0wb2C1NVzT7JHW4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=RL2PiWKmvuUeHbS7DFMpYUBp+J55bPkRJFZYqn24nJbbH73DDymiteRfVHX5eAxSSG PhBNyoW/Px9NyTiyqre4KVgEdgJ5cb5BWr3SOS1P+xCV01GNNy+SntxQ+kdT26Zy1+52 nu2UMKQppaJgHDewbNfE6jrPuzS5hbsd6tEZ0= MIME-Version: 1.0 Received: by 10.223.122.15 with SMTP id j15mr6790753far.74.1236069826955; Tue, 03 Mar 2009 00:43:46 -0800 (PST) In-Reply-To: <49ACEB14.4080100@oskarsson.nu> References: <49ACEB14.4080100@oskarsson.nu> Date: Tue, 3 Mar 2009 09:43:46 +0100 Message-ID: <32120a6a0903030043o44808f36pace5b956e0535393@mail.gmail.com> Subject: Re: Splittable lzo files From: tim robertson To: core-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Thanks for posting this Johan, I tried unsuccessfully to handle GZip files for the reasons you state and resorted to uncompressed. I will try the Lzo format and post the performance difference of compressed vs uncompressed on EC2 which seems to have very slow disk IO. We have seen really bad import speeds (like worse than mini macs even with the largest instances) on postgis and mysql with EC2 so I think this might be very applicable to the EC2 users. Cheers, Tim On Tue, Mar 3, 2009 at 9:32 AM, Johan Oskarsson wrote: > Hi, > > thought I'd pass on this blog post I just wrote about how we compress our > raw log data in Hadoop using Lzo at Last.fm. > > The essence of the post is that we're able to make them splittable by > indexing where each compressed chunk starts in the file, similar to the gzip > input format being worked on. > This actually gives us a performance boost in certain jobs that read a lot > of data while saving us disk space at the same time. > > http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html > > /Johan >