hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5924) Windows: Sort Job failed due to 'Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING'
Date Fri, 13 Jun 2014 05:15:02 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030268#comment-14030268
] 

Zhijie Shen commented on MAPREDUCE-5924:
----------------------------------------

Some investigation based our logs:

TA_COMMIT_PENDING at COMMIT_PENDING happens because there're two TA_COMMIT_PENDING that have
been sent to TaskAttemptListenerImpl. The first one causes the state transition from RUNNING
to COMMIT_PENDING, and the second one causes the invalid transition.

Here's the code that notifies AM of committing the pending task.
{code}
// say the task tracker that task is commit pending
      while (true) {
        try {
          umbilical.commitPending(taskId, taskStatus);
          break;
        } catch (InterruptedException ie) {
          // ignore
        } catch (IOException ie) {
          LOG.warn("Failure sending commit pending: " + 
                    StringUtils.stringifyException(ie));
          if (--retries == 0) {
            System.exit(67);
          }
        }
      }
{code}

The loop has been iterated twice, because the following exception happens:

{code}
2014-06-09 09:15:38,539 WARN [main] org.apache.hadoop.mapred.Task: Failure sending commit
pending: java.io.IOException: Failed on local exception: java.io.IOException: An existing
connection was forcibly closed by the remote host; Host Details : local host is: "sdevin-clster53/10.215.16.72";
destination host is: "sdevin-clster54":63415; 
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
	at org.apache.hadoop.ipc.Client.call(Client.java:1414)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
	at com.sun.proxy.$Proxy9.commitPending(Unknown Source)
	at org.apache.hadoop.mapred.Task.done(Task.java:1006)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:397)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
	at sun.nio.ch.SocketDispatcher.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225)
	at sun.nio.ch.IOUtil.read(IOUtil.java:198)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at java.io.FilterInputStream.read(FilterInputStream.java:133)
	at java.io.FilterInputStream.read(FilterInputStream.java:133)
	at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:510)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1054)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:949)
{code}

Throughout the log, I can see a lot of "IOException: An existing connection was forcibly closed
by the remote host" (This issue seems to have been raised and solved in HADOOP-317, but it
still occurs on Windows), and this particular exception instance results in re-sending committing
message to AM.

In addition, it seem to have a logic problem about re-sending the committing message. Given
the service side has done everything, but IOException happens on the response. The second
should not to be sent by the client, or should be ignored by the service. This is something
we need to fix at the MR side. There're two options to fix the invalid transition:

1. At the state machine, we ignore TA_COMMIT_PENDING when a task attempt is at COMMIT_PENDING.
This is just a one-line fix.

2. We can add RetryCache for TaskAttemptListenerImpl, when commitPending is called, we check
whether we have cache for a particular attempt Id. If we have, it means the transition has
been executed before, and we can skip the transition.





> Windows: Sort Job failed due to 'Invalid event: TA_COMMIT_PENDING at COMMIT_PENDING'
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5924
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5924
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>
> Post the issue on behalf of [~yeshavora]:
> The Sort job over 1GB data failed with below error
> {code}
> 2014-06-09 09:15:38,746 INFO [Socket Reader #1 for port 63415] SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for job_1402304714683_0002 (auth:SIMPLE)
> 2014-06-09 09:15:38,750 INFO [IPC Server handler 13 on 63415] org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Commit-pending state update from attempt_1402304714683_0002_r_000015_1000
> 2014-06-09 09:15:38,751 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
Can't handle this event at current state for attempt_1402304714683_0002_r_000015_1000
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_COMMIT_PENDING
at COMMIT_PENDING
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1058)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:145)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1271)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1263)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:722)
> 2014-06-09 09:15:38,753 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1402304714683_0002Job Transitioned from RUNNING to ERROR
> {code}
> The JobHistory Url prints job state = ERROR



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message