Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 15542 invoked from network); 21 Jan 2011 10:00:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Jan 2011 10:00:32 -0000 Received: (qmail 93099 invoked by uid 500); 21 Jan 2011 10:00:30 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 92846 invoked by uid 500); 21 Jan 2011 10:00:26 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 92836 invoked by uid 99); 21 Jan 2011 10:00:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Jan 2011 10:00:26 +0000 X-ASF-Spam-Status: No, hits=-1.6 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [192.6.10.60] (HELO tobor.hpl.hp.com) (192.6.10.60) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Jan 2011 10:00:17 +0000 Received: from localhost (localhost [127.0.0.1]) by tobor.hpl.hp.com (Postfix) with ESMTP id 9B8BBB7D9C for ; Fri, 21 Jan 2011 09:59:56 +0000 (GMT) X-Virus-Scanned: amavisd-new at hplb.hpl.hp.com Received: from tobor.hpl.hp.com ([127.0.0.1]) by localhost (tobor.hpl.hp.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id POobkLAXUdAK for ; Fri, 21 Jan 2011 09:59:55 +0000 (GMT) Received: from 0-imap-br1.hpl.hp.com (0-imap-br1.hpl.hp.com [16.25.144.60]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by tobor.hpl.hp.com (Postfix) with ESMTPS id BB935B7D9B for ; Fri, 21 Jan 2011 09:59:55 +0000 (GMT) MailScanner-NULL-Check: 1296208779.24888@gouI7sluHHGp4Y3wA0dgxQ Received: from [16.25.175.158] (morzine.hpl.hp.com [16.25.175.158]) by 0-imap-br1.hpl.hp.com (8.14.1/8.13.4) with ESMTP id p0L9xcfZ016959 for ; Fri, 21 Jan 2011 09:59:38 GMT Message-ID: <4D39590A.6010404@apache.org> Date: Fri, 21 Jan 2011 09:59:38 +0000 From: Steve Loughran User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101208 Thunderbird/3.1.7 MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: Why Hadoop is slow in Cloud References: <4D33C18A.5040504@orkash.com> <4D34233F.4070902@apache.org> <4D359CCB.4010901@orkash.com> <4D35CBE3.1090504@apache.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-HPL-MailScanner-Information: Please contact the ISP for more information X-MailScanner-ID: p0L9xcfZ016959 X-HPL-MailScanner: Found to be clean X-HPL-MailScanner-From: stevel@apache.org X-Virus-Checked: Checked by ClamAV on apache.org On 21/01/11 09:20, Evert Lammerts wrote: >> Even with performance hit, there are still benefits running Hadoop this >> way >> -as you only consume/pay for CPU time you use, if you are only >> running >> batch jobs, its lower cost than having a hadoop cluster that is under- >> used. >> >> -if your data is stored in the cloud infrastructure, then you need to >> data mine it in VMs, unless you want to take the time and money hit of >> moving it out, and have somewhere to store it. >> >> -if the infrastructure lets you, you can lock down the cluster so it is >> secure. >> >> Where a physical cluster is good is that it is a very low cost way of >> storing data, provided you can analyse it with Hadoop, and provided you >> can keep that cluster busy most of the time, either with Hadoop work or >> other scheduled work. If your cluster is idle for computation, you are >> still paying the capital and (reduced) electricity costs, so the cost >> of >> storage and what compute you do effectively increases. > > Agreed, but this has little to do with Hadoop as a middleware and more to do > with the benefits of virtualized vs physical infrastructure. I agree that it > is convenient to use HDFS as a DFS to keep your data local to your VMs, but > you could choose other DFS's as well. We don't use HDFS, we bring up VMs close to where the data persists. http://www.slideshare.net/steve_l/high-availability-hadoop > > The major benefit of Hadoop is its data-locality principle, and this is what > you give up when you move to the cloud. Regardless of whether you store your > data within your VM or on a NAS, it *will* have to travel over a line. As > soon as that happens you lose the benefit of data-locality and are left with > MapReduce as a way for parallel computing. And in that case you could use > less restrictive software, like maybe PBS. You could even install HOD on > your virtual cluster, if you'd like the possibility of MapReduce. We don't suffer locality hits so much, but you do pay for the extra infrastructure costs of a more agile datacentre, and if you go to redundancy in hardware over replication, you have less places to run your code. Even on EC2, which doesn't let you tell it what datasets you want to play with for its VM placer to use in its decisions, once data is in the datanodes you do get locality > > Adarsh, there are probably results around of more generic benchmark tools > (Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should > give you a better idea of the penalties of virtualization. (Our experience > with a number of technologies on our OpenNebula cloud is, like Steve points > out, that you mainly pay for disk I/O performance.) -would be interesting to see anything you can publish there... > > I think a decision to go with either cloud or physical infrastructure should > be based on the frequency, intensity and types of computation you expect on > the short term (that should include operations dealing with data), and the > way you think these parameters will develop on a mid-long term. And then > compare the prices of a physical cluster that meets those demands (make sure > to include power and operations) and the investment you would otherwise need > to make in Cloud. +1