hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: recommendation on HDDs
Date Sat, 12 Feb 2011 16:26:27 GMT

All, 

I'd like to clarify somethings...

First the concept is to build out a cluster of commodity hardware. 
So when you do your shopping you want to get the most bang for your buck. That is the 'sweet
spot' that I'm talking about.
When you look at your E5500 or E5600 chip sets, you will want to go with 4 cores per CPU,
dual CPU and a clock speed around 2.53GHz or so.
(Faster chips are more expensive and the performance edge falls off so you end up paying a
premium.)

Looking at your disks, you start with using the on board SATA controller. Why? Because it
means you don't have to pay for a controller card. 
If you are building a cluster for general purpose computing... Assuming 1U boxes you have
room for 4 3.5" SATA which still give you the best performance for your buck.
Can you go with 2.5"? Yes, but you are going to be paying a premium.

Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You could go with
SATA III drives if your motherboard supports the SATA III ports, but you're still paying a
slight premium.

The OP felt that all he would need was 1TB of disk and was considering 4 250GB drives. (More
spindles...yada yada yada...)

My suggestion is to forget that nonsense and go with one 2 TB drive because its a better deal
and if you want to add more disk to the node, you can. (Its easier to add disk than it is
to replace it.)

Now do you need to create a spare OS drive? No. Some people who have an internal 3.5 space
sometimes do. That's ok, and you can put your hadoop logging there. (Just make sure you have
a lot of disk space...)

The truth is that there really isn't any single *right* answer. There are a lot of options
and budget constraints as well as physical constraints like power, space, and location of
the hardware.

Also you may be building out a cluster who's main purpose is to be a backup location for your
cluster. So your production cluster has lots of nodes. Your backup cluster has lots of disks
per node because your main focus is as much storage per node.

So here you may end up buying a 4U rack box, load it up with 3.5" drives and a couple of SATA
controller cards. You care less about performance but more about storage space. Here you may
say 3TB SATA drives w 12 or more per box. (I don't know how many you can fit in to a 4U chassis
these days.  So you have 10 DN backing up a 100+ DN cluster in your main data center. But
that's another story.

I think the main take away you should have is that if you look at the price point... your
best price per GB is on a 2TB drive until the prices drop on 3TB drives.
Since the OP believes that their requirement is 1TB per node... a single 2TB would be the
best choice. It allows for additional space and you really shouldn't be too worried about
disk i/o being your bottleneck.

HTH

-Mike


> Date: Sat, 12 Feb 2011 10:42:50 -0500
> Subject: Re: recommendation on HDDs
> From: edlinuxguru@gmail.com
> To: common-user@hadoop.apache.org
> 
> On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning <tdunning@maprtech.com> wrote:
> > Bandwidth is definitely better with more active spindles.  I would recommend
> > several larger disks.  The cost is very nearly the same.
> >
> > On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi <jshrinivas@gmail.com>wrote:
> >
> >> Thanks for your inputs, Michael.  We have 6 open SATA ports on the
> >> motherboards. That is the reason why we are thinking of 4 to 5 data disks
> >> and 1 OS disk.
> >> Are you suggesting use of one 2TB disk instead of four 500GB disks lets
> >> say?
> >> I thought that the HDFS utilization/throughput increases with the # of
> >> disks
> >> per node (assuming that the total usable IO bandwidth increases
> >> proportionally).
> >>
> >> -Shrinivas
> >>
> >> On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel <michael_segel@hotmail.com
> >> >wrote:
> >>
> >> >
> >> > Shrinivas,
> >> >
> >> > Assuming you're in the US, I'd recommend the following:
> >> >
> >> > Go with 2TB 7200 SATA hard drives.
> >> > (Not sure what type of hardware you have)
> >> >
> >> > What  we've found is that in the data nodes, there's an optimal
> >> > configuration that balances price versus performance.
> >> >
> >> > While your chasis may hold 8 drives, how many open SATA ports are on the
> >> > motherboard? Since you're using JBOD, you don't want the additional
> >> expense
> >> > of having to purchase a separate controller card for the additional
> >> drives.
> >> >
> >> > I'm running Seagate drives at home and I haven't had any problems for
> >> > years.
> >> > When you look at your drive, you need to know total storage, speed
> >> (rpms),
> >> > and cache size.
> >> > Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00
A
> >> > 1TB Seagate was 70.00
> >> > A 250GB SATA drive was $45.00
> >> >
> >> > So 2TB = 110, 140, 180 (respectively)
> >> >
> >> > So you get a better deal on 2TB.
> >> >
> >> > So if you go out and get more drives but of lower density, you'll end up
> >> > spending more money and use more energy, but I doubt you'll see a real
> >> > performance difference.
> >> >
> >> > The other thing is that if you want to add more disk, you have room to
> >> > grow. (Just add more disk and restart the node, right?)
> >> > If all of your disk slots are filled, you're SOL. You have to take out
> >> the
> >> > box, replace all of the drives, then add to cluster as 'new' node.
> >> >
> >> > Just my $0.02 cents.
> >> >
> >> > HTH
> >> >
> >> > -Mike
> >> >
> >> > > Date: Thu, 10 Feb 2011 15:47:16 -0600
> >> > > Subject: Re: recommendation on HDDs
> >> > > From: jshrinivas@gmail.com
> >> > > To: common-user@hadoop.apache.org
> >> > >
> >> > > Hi Ted, Chris,
> >> > >
> >> > > Much appreciate your quick reply. The reason why we are looking for
> >> > smaller
> >> > > capacity drives is because we are not anticipating a huge growth in
> >> data
> >> > > footprint and also read somewhere that larger the capacity of the
> >> drive,
> >> > > bigger the number of platters in them and that could affect drive
> >> > > performance. But looks like you can get 1TB drives with only 2
> >> platters.
> >> > > Large capacity drives should be OK for us as long as they perform
> >> equally
> >> > > well.
> >> > >
> >> > > Also, the systems that we have can host up to 8 SATA drives in them.
In
> >> > that
> >> > > case, would  backplanes offer additional advantages?
> >> > >
> >> > > Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks?  I guess 10K
rpm
> >> > disks
> >> > > would be overkill comparing their perf/cost advantage?
> >> > >
> >> > > Thanks for your inputs.
> >> > >
> >> > > -Shrinivas
> >> > >
> >> > > On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
> >> > chris_j_collins@yahoo.com>wrote:
> >> > >
> >> > > > Of late we have had serious issues with seagate drives in our
hadoop
> >> > > > cluster.  These were purchased over several purchasing cycles
and
> >> > pretty
> >> > > > sure it wasnt just a single "bad batch".   Because of this we
> >> switched
> >> > to
> >> > > > buying 2TB hitachi drives which seem to of been considerably
more
> >> > reliable.
> >> > > >
> >> > > > Best
> >> > > >
> >> > > > C
> >> > > > On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:
> >> > > >
> >> > > > > Get bigger disks.  Data only grows and having extra is always
good.
> >> > > > >
> >> > > > > You can get 2TB drives for <$100 and 1TB for < $75.
> >> > > > >
> >> > > > > As far as transfer rates are concerned, any 3GB/s SATA drive
is
> >> going
> >> > to
> >> > > > be
> >> > > > > about the same (ish).  Seek times will vary a bit with rotation
> >> > speed,
> >> > > > but
> >> > > > > with Hadoop, you will be doing long reads and writes.
> >> > > > >
> >> > > > > Your controller and backplane will have a MUCH bigger vote
in
> >> getting
> >> > > > > acceptable performance.  With only 4 or 5 drives, you don't
have to
> >> > worry
> >> > > > > about super-duper backplane, but you can still kill performance
> >> with
> >> > a
> >> > > > lousy
> >> > > > > controller.
> >> > > > >
> >> > > > > On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
> >> > jshrinivas@gmail.com
> >> > > > >wrote:
> >> > > > >
> >> > > > >> What would be a good hard drive for a 7 node cluster
which is
> >> > targeted
> >> > > > to
> >> > > > >> run a mix of IO and CPU intensive Hadoop workloads?
We are looking
> >> > for
> >> > > > >> around 1 TB of storage on each node distributed amongst
4 or 5
> >> > disks. So
> >> > > > >> either 250GB * 4 disks or 160GB * 5 disks. Also it should
be less
> >> > than
> >> > > > 100$
> >> > > > >> each ;)
> >> > > > >>
> >> > > > >> I looked at HDD benchmark comparisons on tomshardware,
> >> storagereview
> >> > > > etc.
> >> > > > >> Got overwhelmed with the # of benchmarks and different
aspects of
> >> > HDD
> >> > > > >> performance.
> >> > > > >>
> >> > > > >> Appreciate your help on this.
> >> > > > >>
> >> > > > >> -Shrinivas
> >> > > > >>
> >> > > >
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> >
> 
> You also do not need a dedicated OS disk. I typically slice to
> partitions of some of the disks and do a software mirror there. this
> gives you redundancy without having to sacrifice one or two disk slots
> with smaller disks.
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message