hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1800) using map output fetch failures to blacklist nodes is problematic
Date Wed, 19 May 2010 18:41:57 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869258#action_12869258
] 

Joydeep Sen Sarma commented on MAPREDUCE-1800:
----------------------------------------------

if there is a total network partition - then we don't have a problem. either the cluster will
fail outright (let's say JT and NN land up on different sides of the partition) - or one partition
(the one that has the JT/NN) will exclude nodes from the other. (i say we don't have a problem
in the sense that the response of hadoop to such an event is more or less correct).

The problem is that we have had occurences of slow networks that are not quite partitioned.
For example the uplink from one rack switch to the core switch can be flaky/degraded. in this
case - control traffic from the JT to the TTs may be going through - but data traffic from
mappers and reducers on the degraded racks can be really hurt. If there are problems in the
core switch itself (it's underprovisioned) - then the whole cluster is having network problems.
The description applies to such scenarios.

In such a case - the appropriate response of the software should be, at worst,  degraded performance
(in keeping with the degraded nature of the underlying hardware) or at best, correctly identifying
the the slow node(s) and not using them or using them less (this would apply to the flaky
rack uplink scenario). The current response of Hadoop is neither. It makes a bad situation
worse by misassigning blame (when map nodes on good racks are blamed by sufficiently large
number of reducers running on bad racks). We potentially lose nodes from good racks and the
resultant retry of tasks puts further stress on the strained network resource.

A couple of things seem desirable:
1. for enterprise data center environments that (may) have high degree of control and monitoring
around their networking elements - the ability to turn off (selectively) the 
functionality in hadoop that tries to detect and correct for network problems. Diagnostics
stands a much better chance to catch/identify networking problems and fix them.
2. in environments with less control (say Amazon EC2 or hadoop running on a bunch of PCs across
a company) that are more akin to a p2p network - hadoop's network fault diagnosis algorithms
need improvement. A comparison to bittorrent is fair - over there every node advertises it's
upload/download throughput and a node can come across as slow only in comparison to the collective
stats published by all peers (and not just based on communication with a small set of other
peers).



> using map output fetch failures to blacklist nodes is problematic
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1800
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1800
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Joydeep Sen Sarma
>
> If a mapper and a reducer cannot communicate, then either party could be at fault. The
current hadoop protocol allows reducers to declare nodes running the mapper as being at fault.
When sufficient number of reducers do so - then the map node can be blacklisted. 
> In cases where networking problems cause substantial degradation in communication across
sets of nodes - then large number of nodes can become blacklisted as a result of this protocol.
The blacklisting is often wrong (reducers on the smaller side of the network partition can
collectively cause nodes on the larger network partitioned to be blacklisted) and counterproductive
(rerunning maps puts further load on the (already) maxed out network links).
> We should revisit how we can better identify nodes with genuine network problems (and
what role, if any, map-output fetch failures have in this).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message