Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81936DAF3 for ; Wed, 8 Aug 2012 08:04:14 +0000 (UTC) Received: (qmail 69688 invoked by uid 500); 8 Aug 2012 08:04:14 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 69538 invoked by uid 500); 8 Aug 2012 08:04:12 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 69526 invoked by uid 99); 8 Aug 2012 08:04:11 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Aug 2012 08:04:11 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id C78F5142840 for ; Wed, 8 Aug 2012 08:04:11 +0000 (UTC) Date: Wed, 8 Aug 2012 08:04:11 +0000 (UTC) From: "Tsuyoshi OZAWA (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1784279133.3441.1344413051819.JavaMail.jiratomcat@issues-vm> In-Reply-To: <510220190.51347.1339128383136.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (MAPREDUCE-4326) Resurrect RM Restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430935#comment-13430935 ] Tsuyoshi OZAWA commented on MAPREDUCE-4326: ------------------------------------------- > So there may not be the need to store any state as long as the RM can recover the current state of the cluster from the NM's in a reasonable amount of time. It's good idea to avoid saving recoverable states without storing. It's uncertain that it can be recoverable in a reasonable amount of time, so prototyping is needed. > The only state that needs to be save, as far as I can see, is the information about all jobs that are not yet completed. I agree with you. I'll check whether the states of WIP jobs is defined correctly or not. > Also, the implementation seems to be doing blocking calls to ZK etc and will likely end up being a bottleneck on RM threads/perf if a lot of state information needs to be synced to stable store. I think, to avoid being the bottleneck, RM should have a dedicated thread to save the states of RM. The main thread can send the requests of saving the states to the dedicated thread without blocking by using queue or something. Using async APIs to save the states is also effective, however, the code can get complicated. > Resurrect RM Restart > --------------------- > > Key: MAPREDUCE-4326 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4326 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, resourcemanager > Affects Versions: 2.0.0-alpha > Reporter: Arun C Murthy > Assignee: Bikas Saha > Attachments: MR-4343.1.patch > > > We should resurrect 'RM Restart' which we disabled sometime during the RM refactor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira