hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour
Date Tue, 12 Aug 2014 07:19:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093835#comment-14093835
] 

Zhijie Shen commented on MAPREDUCE-6024:
----------------------------------------

bq. 1. For MAX_FETCH_FAILURES_NOTIFICATIONS, if change to proportional to the number of reducers,
it will be same as MAX_ALLOWED_FETCH_FAILURES_FRACTION, so I deleted it. I do believe 

Sounds good to me. Under existing defaults, the only cases that failure will be triggered
previously but not after the patch is fetchFailures <= 2 and shufflingReduceTasks <=3.
According to the problem described in this jira, it makes sense to give fewer chances to the
smaller number of reducer tasks. And if users really want to give the fetcher enough chance,
it can tune MAX_ALLOWED_FETCH_FAILURES_FRACTION, and even make it go beyond 1.0.

bq. 4. Sometimes fetcher can get data successfully after retry from SocketTimeoutException,
so I think let fetcher retry some times is OK.

Sounds reasonable. In addition, I linked back to the previous comments in [MAPREDUCE-4772|https://issues.apache.org/jira/browse/MAPREDUCE-4772?focusedCommentId=13492593&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13492593],
which said connect exception more severe than timeout.

[~venkateshrin], do you have any further comments?

Some more comments:

1. maxfetchfailuresfraction -> max-fetch-failures-fraction? and maxhostfailures -> max-host-failures?
{code}
+  public static final String MAX_ALLOWED_FETCH_FAILURES_FRACTION = "mapreduce.reduce.shuffle.maxfetchfailuresfraction";
{code}
{code}
+  public static final String MAX_SHUFFLE_FETCH_HOST_FAILURES = "mapreduce.reduce.shuffle.maxhostfailures";
{code}

2. Is it necessary to multiply the failures by numMaps? copyFailed is in a loop and invoked
for each remaining/failed task, right?
{code}
+    //report failure if already retried maxHostFailures times
+    boolean hostFail = hostFailures.get(hostname).get() > this.maxHostFailures
+        * numMaps ? true : false;
{code}

BTW, you may want to click "Submit Patch" to ask Jenkins to verify your patch.

> java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6024
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, task
>            Reporter: zhaoyunjiong
>            Assignee: zhaoyunjiong
>            Priority: Critical
>         Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.patch
>
>
> 2014-08-04 21:09:42,356 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher:
Failed to connect to fake.host.name:13562 with 2 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 2014-08-04 21:09:42,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
fake.host.name:13562 freed by fetcher#33 in 180024ms
> 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
Assigning fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 21:09:55,360 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
assigned 3 of 3 to fake.host.name:13562 to fetcher#33
> 2014-08-04 21:12:55,463 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher:
Failed to connect to fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> ...
> 2014-08-04 22:03:13,416 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
fake.host.name:13562 freed by fetcher#33 in 271081ms
> 2014-08-04 22:04:13,417 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
Assigning fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 22:04:13,417 INFO fetcher#33 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
assigned 3 of 3 to fake.host.name:13562 to fetcher#33
> 2014-08-04 22:07:13,449 WARN fetcher#33 org.apache.hadoop.mapreduce.task.reduce.Fetcher:
Failed to connect to fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message