hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiance SI(司宪策) <adam...@gmail.com>
Subject Re: How 'commodity' is 'commodity'
Date Tue, 29 Sep 2009 10:09:21 GMT
Virtualized nodes is a brilliant idea :) This greatly reduced the efforts,
especially when the PCs are not fully in your control.
Xiance

On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran <stevel@apache.org> wrote:

>
> "commodity" really means x86 parts, non-RAID storage, no
> infiniband-connected storage array, no esoteric OS -just Linux- and
> commodity gigabit ether, nothing fancy like 10GBE except on a heavy-utilised
> backbone :) With those kind of configurations, you reduce your capital
> costs, leaving you more money to spend on the electricity bill. I'd still go
> for RAID and/or NFS-mounted  RAID for bits of the namenode/2ary namenode if
> you care about the data.
>
> Taeho Kang wrote:
>
>> If your "commodity" pc's don't have a whole lot of storage space, then you
>> would have to run your HDFS datanodes elsewhere. In that case, a lot of
>> data
>> traffic will occur (e.g. sending data from datanodes to where data
>> processing occurs), meaning map reduce performance will be slowed down.
>> It's
>> always good to have the actual data on the same machine where the
>> processing
>> will occur, or there will be extra network i/o involved.
>>
>> If you decide to host datanodes on pc's, then you also have to be able to
>> protect the data. (e.g. make sure people don't accidentally delete data
>> blocks.)
>>
>> Well, there are lots and lots of possibilities, and I would like to hear
>> how
>> your plan goes, too!
>>
>
> I would go for storing data off the desktop machines, and just using them
> as compute nodes -tasktrackers. This reduces the impact of them going
> offline without warning but lets them do useful work. This will bump up
> their bandwidth needs though.
>
> This still leaves you with the problem of configuring the hadoop cluster
> for all these machines, especially if they are different. To work around
> that, why not creating a VirtualBox or VMWare OS image containing the hadoop
> binaries and configuration files. Everyone who runs the OS image joins the
> cluster, but as soon as they pause it, that tasktracker goes away.
>
> When run Virtualized, HDD and network IO is slower, but if you are only
> connecting to network storage, that network throttling could be useful, it
> will cut back on LAN bandwidth. CPU performance can often be comparable, so
> if your code is CPU-intensive, this can work
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message