Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18FF6EFDB for ; Thu, 29 Nov 2012 08:53:02 +0000 (UTC) Received: (qmail 77400 invoked by uid 500); 29 Nov 2012 08:53:01 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 77164 invoked by uid 500); 29 Nov 2012 08:53:00 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 77124 invoked by uid 99); 29 Nov 2012 08:52:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Nov 2012 08:52:59 +0000 Date: Thu, 29 Nov 2012 08:52:59 +0000 (UTC) From: "Vinod Kumar Vavilapalli (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <119112318.39060.1354179179801.JavaMail.jiratomcat@arcas> In-Reply-To: <543924702.12568.1353511925172.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506328#comment-13506328 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4813: ---------------------------------------------------- Some comments on the patch: - Similar to JobCommitFailedEvent, add an event class for JOB_COMMIT_COMPLETED. - JobImpl.checkJobCompleteSuccess() and corresponding return variables should be renamed to mean checkIfJobReadyForCommit(). Similary, checkJobForCompletion(job). - For now, we may be just be addressing MAPREDUCE-4815, but the same argument of committer being arbitrary user code is valid for other calls like abortJob, setupJob too. We will need states capturing those calls and put them on separate threads so that dispatches isn't blocked. We can do that later, but to be future-proof, let's move the committer-thread to a top-level service ala TaskCleaner. We may even re-purpose TaskCleanerImpl for this. Scope the effort and split it as you see fit. - Commit-thread interrupting and joining is only meaning-ful in the case of kill-during-commit. So let's move that code there. Also, earlier, we never supported kill-during-commit, but now we do and the patch is putting a 60second upper bound on commitJob() before abortJob(). Comparing this with 1.*, we do allow kill-during-commit as commit happens in a separate JVM. So interrupt and join seems fine, let's just put in a config so that we can tweak if ever there is a need. - The test looks good. Can you extend it to include kill-during-commit too. That will also validate that the dispatcher isn't blocked anymore because of long commit. > AM timing out during job commit > ------------------------------- > > Key: MAPREDUCE-4813 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 0.23.3, 2.0.1-alpha > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Critical > Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch > > > The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed. Holding the write lock prevents the RM allocator thread from heartbeating to the RM. Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira