Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Message-ID: <15376218.20471274295476060.JavaMail.jira@thor>
Date: Wed, 19 May 2010 14:57:56 -0400 (EDT)
From: "Todd Lipcon (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Subject: [jira] Commented: (MAPREDUCE-1800) using map output fetch failures
 to blacklist nodes is problematic
In-Reply-To: <26464038.18861274292537891.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869266#action_12869266 ] 

Todd Lipcon commented on MAPREDUCE-1800:
----------------------------------------

Hey Joydeep. Thanks for the further explanation - I agree we could do better here. There's an old JIRA where we threw around some ideas similar to this maybe last August or so, but can't seem to find it at the moment. Anyone remember the one I mean?

> using map output fetch failures to blacklist nodes is problematic
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1800
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1800
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Joydeep Sen Sarma
>
> If a mapper and a reducer cannot communicate, then either party could be at fault. The current hadoop protocol allows reducers to declare nodes running the mapper as being at fault. When sufficient number of reducers do so - then the map node can be blacklisted. 
> In cases where networking problems cause substantial degradation in communication across sets of nodes - then large number of nodes can become blacklisted as a result of this protocol. The blacklisting is often wrong (reducers on the smaller side of the network partition can collectively cause nodes on the larger network partitioned to be blacklisted) and counterproductive (rerunning maps puts further load on the (already) maxed out network links).
> We should revisit how we can better identify nodes with genuine network problems (and what role, if any, map-output fetch failures have in this).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.