hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4457) mr job invalid transition TA_TOO_MANY_FETCH_FAILURE at FAILED
Date Tue, 31 Jul 2012 16:17:35 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425885#comment-13425885
] 

Robert Joseph Evans commented on MAPREDUCE-4457:
------------------------------------------------

When a Job gets a JOB_TASK_ATTEMPT_FETCH_FAILURE it checks to see if the failure percentage
is too high 50% or more of running reducers are complaining about it and at least 3 reducer
attempts have tried to get the data and failed, it will send a TA_TOO_MANY_FETCH_FAILURE event
to the map task attempt.  The JobImpl stores the fetch failure count for each map attempt
as it see the failures.  If the failures get high enough to send the TA_TOO_MANY_FETCH_FAILURE
event the JobImpl then deletes the failure state for the map attempt.

The only scenario in which I can see this happening is a derivative of the following.  Assume
we have a single map task, and 6 reduce tasks.  All six reduce tasks try to fetch from the
map task at almost exactly the same time.  They all fail and report back that they failed.
 After the first three report back the failure percent of running tasks is now exactly 50%,
and is 3 or more, so the TA_TOO_MANY_FETCH_FAILURE event is sent and the state is reset. 
The next three reducer fetch failures are processed and we now have 3 failures which again
is exactly 50%, and 3 or more total failures resulting in another TA_TOO_MANY_FETCH_FAILURE
event being sent.

There may be other situations too where the logic may get confused and it will send the event
twice.  I cannot think of any, but there may be others.  We could either update the JobImpl
to not send the event twice, or we could update the TaskAttemptImpl to handle getting it twice.

                
> mr job invalid transition TA_TOO_MANY_FETCH_FAILURE at FAILED
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-4457
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4457
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Thomas Graves
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>
> we saw a job go into the ERROR state from an invalid state transition.
> 3,600 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_007743_0 TaskAttempt Transitioned from SUCCEEDED
> to FAILED
> 2012-07-16 08:49:53,600 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_008850_0 TaskAttempt Transitioned from SUCCEEDED
> to FAILED
> 2012-07-16 08:49:53,600 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_017344_1000 TaskAttempt Transitioned from RUNNING
> to SUCCESS_CONTAINER_CLEANUP
> 2012-07-16 08:49:53,601 ERROR [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle this
> event at current state for attempt_1342238829791_2501_m_000027_0
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> TA_TOO_MANY_FETCH_FAILURE at FAILED
>     at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
>     at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>     at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>     at
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:954)
>     at
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:133)
>     at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:913)
>     at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:905)
>     at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>     at
> org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.realDispatch(RecoveryService.java:285)
>     at
> org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:281)
>     at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>     at java.lang.Thread.run(Thread.java:619)
> 2012-07-16 08:49:53,601 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1342238829791_2501_m_029091_1000 TaskAttempt Transitioned from RUNNING
> to SUCCESS_CONTAINER_CLEANUP
> 2012-07-16 08:49:53,601 INFO [IPC Server handler 17 on 47153]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
> attempt_1342238829791_2501_r_000461_1000
> It looks like we possibly got 2 TA_TOO_MANY_FETCH_FAILURE events. The first one moved
it to FAILED and then the second one failed because no valid transition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message