hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Seigel <ja...@tynt.com>
Subject Re: recommendation on HDDs
Date Sat, 12 Feb 2011 18:36:36 GMT
The only thing of concern is that the hdfs stuff doesn't seem to do
exceptionally well with different sized disks in practice

James

Sent from my mobile. Please excuse the typos.

On 2011-02-12, at 8:43 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:

> On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning <tdunning@maprtech.com> wrote:
>> Bandwidth is definitely better with more active spindles.  I would recommend
>> several larger disks.  The cost is very nearly the same.
>>
>> On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi <jshrinivas@gmail.com>wrote:
>>
>>> Thanks for your inputs, Michael.  We have 6 open SATA ports on the
>>> motherboards. That is the reason why we are thinking of 4 to 5 data disks
>>> and 1 OS disk.
>>> Are you suggesting use of one 2TB disk instead of four 500GB disks lets
>>> say?
>>> I thought that the HDFS utilization/throughput increases with the # of
>>> disks
>>> per node (assuming that the total usable IO bandwidth increases
>>> proportionally).
>>>
>>> -Shrinivas
>>>
>>> On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel <michael_segel@hotmail.com
>>>> wrote:
>>>
>>>>
>>>> Shrinivas,
>>>>
>>>> Assuming you're in the US, I'd recommend the following:
>>>>
>>>> Go with 2TB 7200 SATA hard drives.
>>>> (Not sure what type of hardware you have)
>>>>
>>>> What  we've found is that in the data nodes, there's an optimal
>>>> configuration that balances price versus performance.
>>>>
>>>> While your chasis may hold 8 drives, how many open SATA ports are on the
>>>> motherboard? Since you're using JBOD, you don't want the additional
>>> expense
>>>> of having to purchase a separate controller card for the additional
>>> drives.
>>>>
>>>> I'm running Seagate drives at home and I haven't had any problems for
>>>> years.
>>>> When you look at your drive, you need to know total storage, speed
>>> (rpms),
>>>> and cache size.
>>>> Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
>>>> 1TB Seagate was 70.00
>>>> A 250GB SATA drive was $45.00
>>>>
>>>> So 2TB = 110, 140, 180 (respectively)
>>>>
>>>> So you get a better deal on 2TB.
>>>>
>>>> So if you go out and get more drives but of lower density, you'll end up
>>>> spending more money and use more energy, but I doubt you'll see a real
>>>> performance difference.
>>>>
>>>> The other thing is that if you want to add more disk, you have room to
>>>> grow. (Just add more disk and restart the node, right?)
>>>> If all of your disk slots are filled, you're SOL. You have to take out
>>> the
>>>> box, replace all of the drives, then add to cluster as 'new' node.
>>>>
>>>> Just my $0.02 cents.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>> Date: Thu, 10 Feb 2011 15:47:16 -0600
>>>>> Subject: Re: recommendation on HDDs
>>>>> From: jshrinivas@gmail.com
>>>>> To: common-user@hadoop.apache.org
>>>>>
>>>>> Hi Ted, Chris,
>>>>>
>>>>> Much appreciate your quick reply. The reason why we are looking for
>>>> smaller
>>>>> capacity drives is because we are not anticipating a huge growth in
>>> data
>>>>> footprint and also read somewhere that larger the capacity of the
>>> drive,
>>>>> bigger the number of platters in them and that could affect drive
>>>>> performance. But looks like you can get 1TB drives with only 2
>>> platters.
>>>>> Large capacity drives should be OK for us as long as they perform
>>> equally
>>>>> well.
>>>>>
>>>>> Also, the systems that we have can host up to 8 SATA drives in them.
In
>>>> that
>>>>> case, would  backplanes offer additional advantages?
>>>>>
>>>>> Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks?  I guess 10K rpm
>>>> disks
>>>>> would be overkill comparing their perf/cost advantage?
>>>>>
>>>>> Thanks for your inputs.
>>>>>
>>>>> -Shrinivas
>>>>>
>>>>> On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
>>>> chris_j_collins@yahoo.com>wrote:
>>>>>
>>>>>> Of late we have had serious issues with seagate drives in our hadoop
>>>>>> cluster.  These were purchased over several purchasing cycles and
>>>> pretty
>>>>>> sure it wasnt just a single "bad batch".   Because of this we
>>> switched
>>>> to
>>>>>> buying 2TB hitachi drives which seem to of been considerably more
>>>> reliable.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> C
>>>>>> On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:
>>>>>>
>>>>>>> Get bigger disks.  Data only grows and having extra is always
good.
>>>>>>>
>>>>>>> You can get 2TB drives for <$100 and 1TB for < $75.
>>>>>>>
>>>>>>> As far as transfer rates are concerned, any 3GB/s SATA drive
is
>>> going
>>>> to
>>>>>> be
>>>>>>> about the same (ish).  Seek times will vary a bit with rotation
>>>> speed,
>>>>>> but
>>>>>>> with Hadoop, you will be doing long reads and writes.
>>>>>>>
>>>>>>> Your controller and backplane will have a MUCH bigger vote in
>>> getting
>>>>>>> acceptable performance.  With only 4 or 5 drives, you don't have
to
>>>> worry
>>>>>>> about super-duper backplane, but you can still kill performance
>>> with
>>>> a
>>>>>> lousy
>>>>>>> controller.
>>>>>>>
>>>>>>> On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
>>>> jshrinivas@gmail.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>> What would be a good hard drive for a 7 node cluster which
is
>>>> targeted
>>>>>> to
>>>>>>>> run a mix of IO and CPU intensive Hadoop workloads? We are
looking
>>>> for
>>>>>>>> around 1 TB of storage on each node distributed amongst 4
or 5
>>>> disks. So
>>>>>>>> either 250GB * 4 disks or 160GB * 5 disks. Also it should
be less
>>>> than
>>>>>> 100$
>>>>>>>> each ;)
>>>>>>>>
>>>>>>>> I looked at HDD benchmark comparisons on tomshardware,
>>> storagereview
>>>>>> etc.
>>>>>>>> Got overwhelmed with the # of benchmarks and different aspects
of
>>>> HDD
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> Appreciate your help on this.
>>>>>>>>
>>>>>>>> -Shrinivas
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>
> You also do not need a dedicated OS disk. I typically slice to
> partitions of some of the disks and do a software mirror there. this
> gives you redundancy without having to sacrifice one or two disk slots
> with smaller disks.

Mime
View raw message