hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seonyoung Park <render...@gmail.com>
Subject Hadoop error in shuffle in fetcher: Exceeded MAX_FAILED_UNIQUE_FETCHES
Date Wed, 07 Jun 2017 03:06:30 GMT
Hi all,

We've run a hadoop cluster (Apache Hadoop 2.7.1) with 40 datanodes.
Currently, we're using Fair Scheduler in our cluster.
And there are no limits on the number of concurrent running jobs.
30 ~ 50 I/O heavy jobs has been running concurrently at dawn.

Recently we got shuffle errors as follows when we had run HDFS Balancer or
spark streaming jobs..

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error
in shuffle in fetcher#2
    at
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
bailing-out.
    at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
    at
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
    at
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:354)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



I also noticed that SocketTimeoutException had occurred in some tasks in
the same job.
But there is no network problem..


Someone said that we need to increase the value of
"mapreduce.tasktracker.http.threads" property.
However, no codes use that property after the commit starting with hash
value 80a05764be5c4f517.


Here are my questions:

1. Is that property currently being used?
2. If so, Is it really helpful to solve our problem?
3. Do we need to fine tune the settings of NodeManagers and DataNodes?
4. Is there any better solution?


Thanks,
Pak

Mime
View raw message