hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4774) JobImpl does not handle asynchronous task events in FAILED state
Date Fri, 09 Nov 2012 22:49:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494389#comment-13494389
] 

Hadoop QA commented on MAPREDUCE-4774:
--------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12552903/MAPREDUCE-4774.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified
test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:red}-1 core tests{color}.  The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app:

                  org.apache.hadoop.mapreduce.v2.app.TestRecovery

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3006//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3006//console

This message is automatically generated.
                
> JobImpl does not handle asynchronous task events in FAILED state
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4774
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Ivan A. Veselovsky
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-4774.patch
>
>
> The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently  fails
in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/
, or 
> https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/).
> The test aims to check Job status notifications received through HTTP Servlet. It runs
3 jobs: successfull, killed, and failed. 
> The test expects the servlet to receive some expected notifications in some expected
order. It also tries to test the retry-on-failure notification functionality, so on each 1st
notification the servlet answers "400 forcing error", and on each 2nd notification attempt
it answers "ok". 
> In general, the test fails because the actual number and/or type of the notifications
differs from the expected.
> Investigation shows that actual root cause of the problem is an incorrect job state transition:
the 3rd job mapred task fails (by intentionally thrown  RuntimeException, see UtilsForTests#runJobFail()),
and the state of the task changes from RUNNING to FAILED.
> At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in  method
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId,
TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but
this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory).
This causes the following exception to be thrown upon the event processing:
> 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED
at FAILED
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716)
>         at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917)
>         at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79)
>         at java.lang.Thread.run(Thread.java:662) 
> So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent:
> http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&amp;jobStatus=ERROR

> (here we can see "ERROR" status instead of "FAILED")
> After that the notification servlet receives either only "ERROR" notification, or one
more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation
in the test behavior caused by racing conditions because there are many asynchronous processings
there, and the test is flaky, in fact).
> In any way, it looks like the root cause of the problem is the possibility of the forbidden
transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED". 
> Need an expert advice on how that should be fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message