We at Yahoo are about to deploy code to ensure a disk failure on a datanode is just that -
a disk failure. Not a node failure. This really helps avoid replication storms.
It's in the 0.20.204 branch for the curious.
Arun
Sent from my iPhone
On Jun 28, 2011, at 3:01 AM, "Steve Loughran" <stevel@apache.org> wrote:
> On 28/06/11 04:49, Segel, Mike wrote:
>> Hmmm. I could have sworn there was a background balancing bandwidth limiter.
>
> There is, for the rebalancer, node outages are taken more seriously,
> though there have been problems in past 0.20.x where there was a risk of
> a cascade failure on a big switch/rack failure. The risk has been
> reduced, though we all await field reports to confirm this :)
>
> You can get 12-24 TB in a server today, which means the loss of a server
> generates a lot of traffic -which argues for 10 Gbe.
>
> But
> -big increase in switch cost, especially if you (CoI warning) go with
> Cisco
> -there have been problems with things like BIOS PXE and lights out
> management on 10 Gbe -probably due to the NICs being things the BIOS
> wasn't expecting and off the mainboard. This should improve.
> -I don't know how well linux works with ether that fast (field reports
> useful)
> -the big threat is still ToR switch failure, as that will trigger a
> re-replication of every block in the rack.
>
> 2x1 Gbe lets you have redundant switches, albeit at the price of more
> wiring, more things to go wrong with the wiring, etc.
>
> The other thing to consider is how well the "enterprise" switches work
> in this world -with a Hadoop cluster you can really test those claims
> how well the switches handle every port lighting up at full rate.
> Indeed, I recommend that as part of your acceptance tests for the switch.
>
>
|