hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6361) NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread and copyFailed() in another thread on the same host
Date Tue, 12 May 2015 09:42:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539594#comment-14539594
] 

Junping Du commented on MAPREDUCE-6361:
---------------------------------------

There are basically two ways to fix the race condition here:
1. abstract following code into a synchronized method, so copySucceeded() would get blocked
until copyFailed() finished.
{code}
scheduler.hostFailed(host.getHostName());
for(TaskAttemptID left: failedTasks) {
    scheduler.copyFailed(left, host, true, false);
}
{code}
This sounds like more performance impact on shuffle as failure in fetching map output on one
thread will block copySucceeded() for other threads with longer time.

2. Update copyFailed() to have assumption that hostFailures could be cleanup in the other
thread. In case of that, adding back host to hostFailed as the first time host failed.

Prefer the 2nd option which sounds more lightweight. Will deliver a quick patch soon.

> NPE issue in shuffle caused by concurrent issue between copySucceeded() in one thread
and copyFailed() in another thread on the same host
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6361
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6361
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> The failure in log:
> 2015-05-08 21:00:00,513 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running
child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in
fetcher#25
>          at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>          at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:415)
>          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>          at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>          at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:267)
>          at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:308)
>          at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message