Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 75444 invoked from network); 12 Feb 2011 16:27:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Feb 2011 16:27:01 -0000 Received: (qmail 57658 invoked by uid 500); 12 Feb 2011 16:26:58 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 57183 invoked by uid 500); 12 Feb 2011 16:26:55 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 57175 invoked by uid 99); 12 Feb 2011 16:26:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Feb 2011 16:26:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com designates 65.55.34.221 as permitted sender) Received: from [65.55.34.221] (HELO col0-omc4-s19.col0.hotmail.com) (65.55.34.221) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Feb 2011 16:26:47 +0000 Received: from COL117-W1 ([65.55.34.201]) by col0-omc4-s19.col0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Sat, 12 Feb 2011 08:26:27 -0800 Message-ID: Content-Type: multipart/alternative; boundary="_8ee9cfad-0cf4-4900-892e-ebc3940c40b2_" X-Originating-IP: [173.15.87.33] From: Michael Segel To: Subject: RE: recommendation on HDDs Date: Sat, 12 Feb 2011 10:26:27 -0600 Importance: Normal In-Reply-To: References: ,,<97DF2E6E-3441-4836-BDFB-AB10C45DD4B4@yahoo.com>,,,,, MIME-Version: 1.0 X-OriginalArrivalTime: 12 Feb 2011 16:26:27.0280 (UTC) FILETIME=[98EC7500:01CBCAD1] --_8ee9cfad-0cf4-4900-892e-ebc3940c40b2_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable All=2C=20 I'd like to clarify somethings... First the concept is to build out a cluster of commodity hardware.=20 So when you do your shopping you want to get the most bang for your buck. T= hat is the 'sweet spot' that I'm talking about. When you look at your E5500 or E5600 chip sets=2C you will want to go with = 4 cores per CPU=2C dual CPU and a clock speed around 2.53GHz or so. (Faster chips are more expensive and the performance edge falls off so you = end up paying a premium.) Looking at your disks=2C you start with using the on board SATA controller.= Why? Because it means you don't have to pay for a controller card.=20 If you are building a cluster for general purpose computing... Assuming 1U = boxes you have room for 4 3.5" SATA which still give you the best performan= ce for your buck. Can you go with 2.5"? Yes=2C but you are going to be paying a premium. Price wise=2C a 2TB SATA II 7200 RPM drive is going to be your best deal. Y= ou could go with SATA III drives if your motherboard supports the SATA III = ports=2C but you're still paying a slight premium. The OP felt that all he would need was 1TB of disk and was considering 4 25= 0GB drives. (More spindles...yada yada yada...) My suggestion is to forget that nonsense and go with one 2 TB drive because= its a better deal and if you want to add more disk to the node=2C you can.= (Its easier to add disk than it is to replace it.) Now do you need to create a spare OS drive? No. Some people who have an int= ernal 3.5 space sometimes do. That's ok=2C and you can put your hadoop logg= ing there. (Just make sure you have a lot of disk space...) The truth is that there really isn't any single *right* answer. There are a= lot of options and budget constraints as well as physical constraints like= power=2C space=2C and location of the hardware. Also you may be building out a cluster who's main purpose is to be a backup= location for your cluster. So your production cluster has lots of nodes. Y= our backup cluster has lots of disks per node because your main focus is as= much storage per node. So here you may end up buying a 4U rack box=2C load it up with 3.5" drives = and a couple of SATA controller cards. You care less about performance but = more about storage space. Here you may say 3TB SATA drives w 12 or more per= box. (I don't know how many you can fit in to a 4U chassis these days. So= you have 10 DN backing up a 100+ DN cluster in your main data center. But = that's another story. I think the main take away you should have is that if you look at the price= point... your best price per GB is on a 2TB drive until the prices drop on= 3TB drives. Since the OP believes that their requirement is 1TB per node... a single 2T= B would be the best choice. It allows for additional space and you really s= houldn't be too worried about disk i/o being your bottleneck. HTH -Mike > Date: Sat=2C 12 Feb 2011 10:42:50 -0500 > Subject: Re: recommendation on HDDs > From: edlinuxguru@gmail.com > To: common-user@hadoop.apache.org >=20 > On Fri=2C Feb 11=2C 2011 at 7:14 PM=2C Ted Dunning wrote: > > Bandwidth is definitely better with more active spindles. I would reco= mmend > > several larger disks. The cost is very nearly the same. > > > > On Fri=2C Feb 11=2C 2011 at 3:52 PM=2C Shrinivas Joshi wrote: > > > >> Thanks for your inputs=2C Michael. We have 6 open SATA ports on the > >> motherboards. That is the reason why we are thinking of 4 to 5 data di= sks > >> and 1 OS disk. > >> Are you suggesting use of one 2TB disk instead of four 500GB disks let= s > >> say? > >> I thought that the HDFS utilization/throughput increases with the # of > >> disks > >> per node (assuming that the total usable IO bandwidth increases > >> proportionally). > >> > >> -Shrinivas > >> > >> On Thu=2C Feb 10=2C 2011 at 4:25 PM=2C Michael Segel >> >wrote: > >> > >> > > >> > Shrinivas=2C > >> > > >> > Assuming you're in the US=2C I'd recommend the following: > >> > > >> > Go with 2TB 7200 SATA hard drives. > >> > (Not sure what type of hardware you have) > >> > > >> > What we've found is that in the data nodes=2C there's an optimal > >> > configuration that balances price versus performance. > >> > > >> > While your chasis may hold 8 drives=2C how many open SATA ports are = on the > >> > motherboard? Since you're using JBOD=2C you don't want the additiona= l > >> expense > >> > of having to purchase a separate controller card for the additional > >> drives. > >> > > >> > I'm running Seagate drives at home and I haven't had any problems fo= r > >> > years. > >> > When you look at your drive=2C you need to know total storage=2C spe= ed > >> (rpms)=2C > >> > and cache size. > >> > Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.= 00 A > >> > 1TB Seagate was 70.00 > >> > A 250GB SATA drive was $45.00 > >> > > >> > So 2TB =3D 110=2C 140=2C 180 (respectively) > >> > > >> > So you get a better deal on 2TB. > >> > > >> > So if you go out and get more drives but of lower density=2C you'll = end up > >> > spending more money and use more energy=2C but I doubt you'll see a = real > >> > performance difference. > >> > > >> > The other thing is that if you want to add more disk=2C you have roo= m to > >> > grow. (Just add more disk and restart the node=2C right?) > >> > If all of your disk slots are filled=2C you're SOL. You have to take= out > >> the > >> > box=2C replace all of the drives=2C then add to cluster as 'new' nod= e. > >> > > >> > Just my $0.02 cents. > >> > > >> > HTH > >> > > >> > -Mike > >> > > >> > > Date: Thu=2C 10 Feb 2011 15:47:16 -0600 > >> > > Subject: Re: recommendation on HDDs > >> > > From: jshrinivas@gmail.com > >> > > To: common-user@hadoop.apache.org > >> > > > >> > > Hi Ted=2C Chris=2C > >> > > > >> > > Much appreciate your quick reply. The reason why we are looking fo= r > >> > smaller > >> > > capacity drives is because we are not anticipating a huge growth i= n > >> data > >> > > footprint and also read somewhere that larger the capacity of the > >> drive=2C > >> > > bigger the number of platters in them and that could affect drive > >> > > performance. But looks like you can get 1TB drives with only 2 > >> platters. > >> > > Large capacity drives should be OK for us as long as they perform > >> equally > >> > > well. > >> > > > >> > > Also=2C the systems that we have can host up to 8 SATA drives in t= hem. In > >> > that > >> > > case=2C would backplanes offer additional advantages? > >> > > > >> > > Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks? I guess 10K= rpm > >> > disks > >> > > would be overkill comparing their perf/cost advantage? > >> > > > >> > > Thanks for your inputs. > >> > > > >> > > -Shrinivas > >> > > > >> > > On Thu=2C Feb 10=2C 2011 at 2:48 PM=2C Chris Collins < > >> > chris_j_collins@yahoo.com>wrote: > >> > > > >> > > > Of late we have had serious issues with seagate drives in our ha= doop > >> > > > cluster. These were purchased over several purchasing cycles an= d > >> > pretty > >> > > > sure it wasnt just a single "bad batch". Because of this we > >> switched > >> > to > >> > > > buying 2TB hitachi drives which seem to of been considerably mor= e > >> > reliable. > >> > > > > >> > > > Best > >> > > > > >> > > > C > >> > > > On Feb 10=2C 2011=2C at 12:43 PM=2C Ted Dunning wrote: > >> > > > > >> > > > > Get bigger disks. Data only grows and having extra is always = good. > >> > > > > > >> > > > > You can get 2TB drives for <$100 and 1TB for < $75. > >> > > > > > >> > > > > As far as transfer rates are concerned=2C any 3GB/s SATA drive= is > >> going > >> > to > >> > > > be > >> > > > > about the same (ish). Seek times will vary a bit with rotatio= n > >> > speed=2C > >> > > > but > >> > > > > with Hadoop=2C you will be doing long reads and writes. > >> > > > > > >> > > > > Your controller and backplane will have a MUCH bigger vote in > >> getting > >> > > > > acceptable performance. With only 4 or 5 drives=2C you don't = have to > >> > worry > >> > > > > about super-duper backplane=2C but you can still kill performa= nce > >> with > >> > a > >> > > > lousy > >> > > > > controller. > >> > > > > > >> > > > > On Thu=2C Feb 10=2C 2011 at 12:26 PM=2C Shrinivas Joshi < > >> > jshrinivas@gmail.com > >> > > > >wrote: > >> > > > > > >> > > > >> What would be a good hard drive for a 7 node cluster which is > >> > targeted > >> > > > to > >> > > > >> run a mix of IO and CPU intensive Hadoop workloads? We are lo= oking > >> > for > >> > > > >> around 1 TB of storage on each node distributed amongst 4 or = 5 > >> > disks. So > >> > > > >> either 250GB * 4 disks or 160GB * 5 disks. Also it should be = less > >> > than > >> > > > 100$ > >> > > > >> each =3B) > >> > > > >> > >> > > > >> I looked at HDD benchmark comparisons on tomshardware=2C > >> storagereview > >> > > > etc. > >> > > > >> Got overwhelmed with the # of benchmarks and different aspect= s of > >> > HDD > >> > > > >> performance. > >> > > > >> > >> > > > >> Appreciate your help on this. > >> > > > >> > >> > > > >> -Shrinivas > >> > > > >> > >> > > > > >> > > > > >> > > > > >> > > >> > > >> > > >=20 > You also do not need a dedicated OS disk. I typically slice to > partitions of some of the disks and do a software mirror there. this > gives you redundancy without having to sacrifice one or two disk slots > with smaller disks. = --_8ee9cfad-0cf4-4900-892e-ebc3940c40b2_--