hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3130) Shuffling takes too long to get the last map output.
Date Fri, 04 Apr 2008 14:25:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585546#action_12585546
] 

Runping Qi commented on HADOOP-3130:
------------------------------------



Lot of reducers failed with the following message:

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
bailing-out.
  

I see a lot of the following exceptions in the log:

2008-04-04 13:50:03,796 WARN org.apache.hadoop.mapred.ReduceTask: task_200804041304_0005_r_000000_2
copy failed: task_200804041304_0005_m_000181_0 from xxxx.com
2008-04-04 13:50:03,823 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException:
Read timed out
	at sun.reflect.GeneratedConstructorAccessor3.newInstance(Unknown Source)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1298)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1292)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:948)
	at org.apache.hadoop.mapred.MapOutputLocation.getInputStream(MapOutputLocation.java:125)
	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:165)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:129)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:632)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:577)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1004)
	... 4 more

Did you also change the timeout for read?

what is the value for Exceeded MAX_FAILED_UNIQUE_FETCHES?
Should that be some percentage of the total num of maps?

Anyhow, we need to revisit the policy for failing a reducer during shuffling.


> Shuffling takes too long to get the last map output.
> ----------------------------------------------------
>
>                 Key: HADOOP-3130
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3130
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>         Attachments: HADOOP-3130-v2.patch, HADOOP-3130.patch, shuffling.log
>
>
> I noticed that towards the end of shufflling, the map output fetcher of the reducer backs
off too aggressively.
> I attach a fraction of one reduce log of my job.
> Noticed that the last map output was not fetched in 2 minutes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message