hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1800) using map output fetch failures to blacklist nodes is problematic
Date Wed, 19 May 2010 18:08:57 GMT
using map output fetch failures to blacklist nodes is problematic

                 Key: MAPREDUCE-1800
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1800
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Joydeep Sen Sarma

If a mapper and a reducer cannot communicate, then either party could be at fault. The current
hadoop protocol allows reducers to declare nodes running the mapper as being at fault. When
sufficient number of reducers do so - then the map node can be blacklisted. 

In cases where networking problems cause substantial degradation in communication across sets
of nodes - then large number of nodes can become blacklisted as a result of this protocol.
The blacklisting is often wrong (reducers on the smaller side of the network partition can
collectively cause nodes on the larger network partitioned to be blacklisted) and counterproductive
(rerunning maps puts further load on the (already) maxed out network links).

We should revisit how we can better identify nodes with genuine network problems (and what
role, if any, map-output fetch failures have in this).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message