hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6303) Read timeout when retrying a fetch error can be fatal to a reducer
Date Wed, 01 Apr 2015 21:37:53 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391516#comment-14391516
] 

Jason Lowe commented on MAPREDUCE-6303:
---------------------------------------

Sample reduce log snippet showing the issue:

{noformat}
2015-03-28 00:31:54,393 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running
child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in
fetcher#7
	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:150)
	at java.net.SocketInputStream.read(SocketInputStream.java:121)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:633)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:579)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1322)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:392)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:338)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)

2015-03-28 00:31:54,511 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the
task
{noformat}

The problem is that the code caught an IOException trying to shuffle and within the catch
block the code throws _again_ which leaks up to the top of the Fetcher thread and kills the
task.

> Read timeout when retrying a fetch error can be fatal to a reducer
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6303
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6303
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> If a reducer encounters an error trying to fetch from a node then encounters a read timeout
when trying to re-establish the connection then the reducer can fail.  The read timeout exception
can leak to the top of the Fetcher thread which will cause the reduce task to teardown.  This
type of error can repeat across reducer attempts causing jobs to fail due to a single bad
node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message