hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: How 'commodity' is 'commodity'
Date Tue, 29 Sep 2009 14:57:37 GMT
On Tue, Sep 29, 2009 at 6:09 AM, Xiance SI(司宪策) <adam.si@gmail.com> wrote:
> Virtualized nodes is a brilliant idea :) This greatly reduced the efforts,
> especially when the PCs are not fully in your control.
> Xiance
>
> On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran <stevel@apache.org> wrote:
>
>>
>> "commodity" really means x86 parts, non-RAID storage, no
>> infiniband-connected storage array, no esoteric OS -just Linux- and
>> commodity gigabit ether, nothing fancy like 10GBE except on a heavy-utilised
>> backbone :) With those kind of configurations, you reduce your capital
>> costs, leaving you more money to spend on the electricity bill. I'd still go
>> for RAID and/or NFS-mounted  RAID for bits of the namenode/2ary namenode if
>> you care about the data.
>>
>> Taeho Kang wrote:
>>
>>> If your "commodity" pc's don't have a whole lot of storage space, then you
>>> would have to run your HDFS datanodes elsewhere. In that case, a lot of
>>> data
>>> traffic will occur (e.g. sending data from datanodes to where data
>>> processing occurs), meaning map reduce performance will be slowed down.
>>> It's
>>> always good to have the actual data on the same machine where the
>>> processing
>>> will occur, or there will be extra network i/o involved.
>>>
>>> If you decide to host datanodes on pc's, then you also have to be able to
>>> protect the data. (e.g. make sure people don't accidentally delete data
>>> blocks.)
>>>
>>> Well, there are lots and lots of possibilities, and I would like to hear
>>> how
>>> your plan goes, too!
>>>
>>
>> I would go for storing data off the desktop machines, and just using them
>> as compute nodes -tasktrackers. This reduces the impact of them going
>> offline without warning but lets them do useful work. This will bump up
>> their bandwidth needs though.
>>
>> This still leaves you with the problem of configuring the hadoop cluster
>> for all these machines, especially if they are different. To work around
>> that, why not creating a VirtualBox or VMWare OS image containing the hadoop
>> binaries and configuration files. Everyone who runs the OS image joins the
>> cluster, but as soon as they pause it, that tasktracker goes away.
>>
>> When run Virtualized, HDD and network IO is slower, but if you are only
>> connecting to network storage, that network throttling could be useful, it
>> will cut back on LAN bandwidth. CPU performance can often be comparable, so
>> if your code is CPU-intensive, this can work
>>
>

In hadoop terms commodity means "Not super computer". If you look
around most large deployments have DataNodes with duel quad core
processors 8+GB ram and numerous disks, that is hardly the PC you find
under your desk.

For example, we had a dev clusters is a very modest setup, 5 HP DL
145. Duel Core 4 GB RAM, 2 SATA DISKS.

I did not do any elaborate testing, but i found that:

!!!!One!!!! DL180G5 (2x Quad Code) (8GB ram) 8 SATA disk crushed the 5
node cluster. So it is possible but it might be more administrative
work then it is worth.

Mime
View raw message