hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Prakash <ravihad...@gmail.com>
Subject Re: Hadoop error in shuffle in fetcher: Exceeded MAX_FAILED_UNIQUE_FETCHES
Date Wed, 07 Jun 2017 18:12:21 GMT
Hi Seonyoung!

Please take a look at this file :
https://github.com/apache/hadoop/blob/branch-2.7.1/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java#L208
.

This is an auxiliary service that runs inside the NodeManager which
provides the intermediate data.

Cheers
Ravi

On Tue, Jun 6, 2017 at 8:06 PM, Seonyoung Park <renderaid@gmail.com> wrote:

> Hi all,
>
> We've run a hadoop cluster (Apache Hadoop 2.7.1) with 40 datanodes.
> Currently, we're using Fair Scheduler in our cluster.
> And there are no limits on the number of concurrent running jobs.
> 30 ~ 50 I/O heavy jobs has been running concurrently at dawn.
>
> Recently we got shuffle errors as follows when we had run HDFS Balancer or
> spark streaming jobs..
>
> Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
> error in shuffle in fetcher#2
>     at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(
> Shuffle.java:134)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
>     at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.
> checkReducerHealth(ShuffleSchedulerImpl.java:366)
>     at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.
> copyFailed(ShuffleSchedulerImpl.java:288)
>     at org.apache.hadoop.mapreduce.task.reduce.Fetcher.
> copyFromHost(Fetcher.java:354)
>     at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(
> Fetcher.java:193)
>
>
>
> I also noticed that SocketTimeoutException had occurred in some tasks in
> the same job.
> But there is no network problem..
>
>
> Someone said that we need to increase the value of
> "mapreduce.tasktracker.http.threads" property.
> However, no codes use that property after the commit starting with hash
> value 80a05764be5c4f517.
>
>
> Here are my questions:
>
> 1. Is that property currently being used?
> 2. If so, Is it really helpful to solve our problem?
> 3. Do we need to fine tune the settings of NodeManagers and DataNodes?
> 4. Is there any better solution?
>
>
> Thanks,
> Pak
>

Mime
View raw message