Return-Path: X-Original-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 53FFBD617 for ; Tue, 6 Nov 2012 11:30:15 +0000 (UTC) Received: (qmail 65225 invoked by uid 500); 6 Nov 2012 11:30:14 -0000 Delivered-To: apmail-hadoop-mapreduce-dev-archive@hadoop.apache.org Received: (qmail 64959 invoked by uid 500); 6 Nov 2012 11:30:13 -0000 Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-dev@hadoop.apache.org Received: (qmail 64909 invoked by uid 99); 6 Nov 2012 11:30:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Nov 2012 11:30:12 +0000 Date: Tue, 6 Nov 2012 11:30:12 +0000 (UTC) From: "Ivan A. Veselovsky (JIRA)" To: mapreduce-dev@hadoop.apache.org Message-ID: <330265410.74038.1352201412277.JavaMail.jiratomcat@arcas> Subject: [jira] [Created] (MAPREDUCE-4774) repair test org.apache.hadoop.mapred.TestClusterMRNotification.testMR MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Ivan A. Veselovsky created MAPREDUCE-4774: --------------------------------------------- Summary: repair test org.apache.hadoop.mapred.TestClusterMRNotification.testMR Key: MAPREDUCE-4774 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4774 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Ivan A. Veselovsky The test org.apache.hadoop.mapred.TestClusterMRNotification.testMR frequently fails in mapred build (e.g. see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2988/testReport/junit/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/ , or https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2982//testReport/org.apache.hadoop.mapred/TestClusterMRNotification/testMR/). The test aims to check Job status notifications received through HTTP Servlet. It runs 3 jobs: successfull, killed, and failed. The test expects the servlet to receive some expected notifications in some expected order. It also tries to test the retry-on-failure notification functionality, so on each 1st notification the servlet answers "400 forcing error", and on each 2nd notification attempt it answers "ok". In general, the test fails because the actual number and/or type of the notifications differs from the expected. Investigation shows that actual root cause of the problem is an incorrect job state transition: the 3rd job mapred task fails (by intentionally thrown RuntimeException, see UtilsForTests#runJobFail()), and the state of the task changes from RUNNING to FAILED. At this point JobEventType.JOB_TASK_ATTEMPT_COMPLETED event is submitted (in method org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handleTaskAttemptCompletion(TaskAttemptId, TaskAttemptCompletionEventStatus)), and this event gets processed in AsyncDispatcher, but this transition is impossible according to the event transition map (see org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl#stateMachineFactory). This causes the following exception to be thrown upon the event processing: 2012-11-06 12:22:02,335 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:309) at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:290) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:454) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:716) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:917) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:79) at java.lang.Thread.run(Thread.java:662) So, the job gets into state "INTERNAL_ERROR", the job end notification like this is sent: http://localhost:48656/notification/mapred?jobId=job_1352199715842_0002&jobStatus=ERROR (here we can see "ERROR" status instead of "FAILED") After that the notification servlet receives either only "ERROR" notification, or one more notification "ERROR" after "FAILED", which finally causes the test to fail. (Some variation in the test behavior caused by racing conditions because there are many asynchronous processings there, and the test is flaky, in fact). In any way, it looks like the root cause of the problem is the possibility of the forbidden transition "Invalid event: JOB_TASK_ATTEMPT_COMPLETED at FAILED". Need an expert advice on how that should be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira