Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 66762 invoked from network); 11 Mar 2009 20:09:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Mar 2009 20:09:06 -0000 Received: (qmail 36108 invoked by uid 500); 11 Mar 2009 20:08:55 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 36066 invoked by uid 500); 11 Mar 2009 20:08:55 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 36055 invoked by uid 99); 11 Mar 2009 20:08:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Mar 2009 13:08:55 -0700 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mmatalka@millennialmedia.com designates 64.78.17.143 as permitted sender) Received: from [64.78.17.143] (HELO EXSMTP012-12.exch012.intermedia.net) (64.78.17.143) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 11 Mar 2009 20:08:45 +0000 Received: from EXVBE012-11.exch012.intermedia.net ([207.5.74.173]) by EXSMTP012-12.exch012.intermedia.net with Microsoft SMTPSVC(6.0.3790.1830); Wed, 11 Mar 2009 13:08:25 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Persistent HDFS On EC2 Date: Wed, 11 Mar 2009 13:08:27 -0700 Message-ID: <4732284D6B19F34D8939C04EE064947504856B3B@EXVBE012-11.exch012.intermedia.net> In-Reply-To: <42a1925b0903111258u519eebb9tfbee9b444c3f1c10@mail.gmail.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Persistent HDFS On EC2 Thread-Index: Acmig+cqZPFe1ZkQS12ufVF/G4+S7QAATc1Q References: <4732284D6B19F34D8939C04EE064947504856845@EXVBE012-11.exch012.intermedia.net> <49B7BF05.9040608@apache.org> <42a1925b0903110748yd1b330ep1b97370b1d499ae2@mail.gmail.com> <49B80B57.5020706@tubemogul.com> <42a1925b0903111258u519eebb9tfbee9b444c3f1c10@mail.gmail.com> From: "Malcolm Matalka" To: X-OriginalArrivalTime: 11 Mar 2009 20:08:25.0348 (UTC) FILETIME=[22B61C40:01C9A285] X-Virus-Checked: Checked by ClamAV on apache.org Haha, good to know I might be a guinea pig! -----Original Message----- From: Kris Jirapinyo [mailto:kris.jirapinyo@biz360.com]=20 Sent: Wednesday, March 11, 2009 15:59 To: core-user@hadoop.apache.org Subject: Re: Persistent HDFS On EC2 That was also the starting point for my experiment (Tom White's article). Note that the most painful part about this setup is probably writing and testing the scripts that will enable this to happen (and also customizing your EC2 images). It would be interesting to see someone else try it. On Wed, Mar 11, 2009 at 12:04 PM, Adam Rose wrote: > Tom White wrote a great blog post about some options here: > > http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.h tml > > plus an Amazon article: > > > http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3D87= 3 &categoryID=3D112 > > Regards, > > - Adam > > > > Kris Jirapinyo wrote: > >> Why would you lose the locality of storage-per-machine if one EBS volume >> is >> mounted to each machine instance? When that machine goes down, you can >> just >> restart the instance and re-mount the exact same volume. I've tried this >> idea before successfully on a 10 node cluster on EC2, and didn't see any >> adverse performance effects--and actually amazon claims that EBS I/O >> should >> be even better than the instance stores. The only concerns I see are that >> you need to pay for EBS storage regardless of whether you use that storage >> or not. So, if you have 10 EBS volumes of 1 TB each, and you're just >> starting out with your cluster so you're using only 50GB on each EBS >> volume >> so far for the month, you'd still have to pay for 10TB worth of EBS >> volumes, >> and that could be a hefty price for each month. Also, currently EBS needs >> to be created in the same availability zone as your instances, so you need >> to make sure that they are created correctly, as there is no direct >> migration of EBS to different availability zones. >> >> >> On Wed, Mar 11, 2009 at 6:39 AM, Steve Loughran >> wrote: >> >> >> >>> Malcolm Matalka wrote: >>> >>> >>> >>>> If this is not the correct place to ask Hadoop + EC2 questions please >>>> let me know. >>>> >>>> >>>> I am trying to get a handle on how to use Hadoop on EC2 before >>>> committing any money to it. My question is, how do I maintain a >>>> persistent HDFS between restarts of instances. Most of the tutorials I >>>> have found involve the cluster being wiped once all the instances are >>>> shut down but in my particular case I will be feeding output of a >>>> previous days run as the input of the current days run and this data >>>> will get large over time. I see I can use s3 as the file system, would >>>> I just create an EBS volume for each instance? What are my options? >>>> >>>> >>>> >>> EBS would cost you more; you'd lose the locality of storage-per-machine. >>> >>> If you stick the output of some runs back into S3 then the next jobs have >>> no locality and higher startup overhead to pull the data down, but you >>> dont >>> pay for that download (just the time it takes). >>> >>> >>> >> >> >> >