hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkumar Vadali (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1800) using map output fetch failures to blacklist nodes is problematic
Date Thu, 20 May 2010 19:21:18 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869734#action_12869734
] 

Ramkumar Vadali commented on MAPREDUCE-1800:
--------------------------------------------

>From my understanding, a map output fetch is a HTTP GET. I agree that TCP-level network
errors are not good indicators to use for blacklisting since it is not possible to distinguish
between a server-side error and a network error. But HTTP-level errors, especially HTTP 5xx
errors (used for server-side errors) should be used for blacklisting. Disk failures that prevent
the HTTP server from reading a file would fall in this category. It is possible that such
errors could be detected by Map failures, but a HTTP 5xx error is a fairly reliable indicator
of error.

So in my opinion we should retain the blacklisting, but make it smarter to use HTTP-level
error information.

> using map output fetch failures to blacklist nodes is problematic
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1800
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1800
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Joydeep Sen Sarma
>
> If a mapper and a reducer cannot communicate, then either party could be at fault. The
current hadoop protocol allows reducers to declare nodes running the mapper as being at fault.
When sufficient number of reducers do so - then the map node can be blacklisted. 
> In cases where networking problems cause substantial degradation in communication across
sets of nodes - then large number of nodes can become blacklisted as a result of this protocol.
The blacklisting is often wrong (reducers on the smaller side of the network partition can
collectively cause nodes on the larger network partitioned to be blacklisted) and counterproductive
(rerunning maps puts further load on the (already) maxed out network links).
> We should revisit how we can better identify nodes with genuine network problems (and
what role, if any, map-output fetch failures have in this).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message