Return-Path: X-Original-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52D82D6DF for ; Thu, 8 Nov 2012 18:39:02 +0000 (UTC) Received: (qmail 64404 invoked by uid 500); 8 Nov 2012 18:39:02 -0000 Delivered-To: apmail-hadoop-yarn-dev-archive@hadoop.apache.org Received: (qmail 64374 invoked by uid 500); 8 Nov 2012 18:39:02 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-dev@hadoop.apache.org Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 64365 invoked by uid 99); 8 Nov 2012 18:39:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 18:39:02 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of arinto@gmail.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 18:38:55 +0000 Received: by mail-ie0-f176.google.com with SMTP id k11so5317541iea.35 for ; Thu, 08 Nov 2012 10:38:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=u+Ip90MKkV0CRyyaiGBGMaMzxgoWTwr1zecMZ35b12g=; b=xi2wTJrqqRGpWWLw4bOqe96oUk8cszzsjyeBF1DmgSUQz+aRImRwWBkmO7QR9Mj17Q EU7Cd6teRuQQ8y11lmRo0lunBqmRkLBUwka1ai/xegdr9/Sn0a4odEzTzgKYkvWDFKY/ wx0ZRd19DXoISUO660FGcSuwphyX41aSUJ4pP9rrf0bXviXQzsrfUcYh4pkEycNFgrfA Gu9O4iPMJNNfsU6Z5FyOmZeuqh0DC9isvXSvR9eF1/+PYwadOMawo7+aNfIPR3Z7L0Yn 2cowPjKINxmTA4M41lzR07bPCTHy0bOGrbrHyxM81hVfKuAfgtmCPcm+Q3PX+/qQXBsl R5Pg== Received: by 10.50.77.166 with SMTP id t6mr9083000igw.72.1352399914549; Thu, 08 Nov 2012 10:38:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.99.4 with HTTP; Thu, 8 Nov 2012 10:38:14 -0800 (PST) In-Reply-To: <50970f76.044d420a.5346.225bSMTPIN_ADDED@mx.google.com> References: <50970f76.044d420a.5346.225bSMTPIN_ADDED@mx.google.com> From: Arinto Murdopo Date: Thu, 8 Nov 2012 19:38:14 +0100 Message-ID: Subject: Re: CONTAINER_FINISHED event when RMAppAttemptImpl is RECOVERING To: yarn-dev@hadoop.apache.org, devaraj.k@huawei.com Content-Type: multipart/alternative; boundary=e89a8f3baff780436a04ce00241b X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f3baff780436a04ce00241b Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable We're using default scheduler, which is CapacityScheduler according to YarnConfiguration.java Arinto Murdopo European Master in Distributed Computing (EMDC) Universitat Polit=E8cnica de Catalunya =B7 BarcelonaTech, Barcelona, Spain KTH Royal Institute of Technology, Stockholm, Sweden Phone: +46 725 548 759 On Mon, Nov 5, 2012 at 1:58 AM, Devaraj K wrote: > Hi Arinto, > > > Could you please confirm, what is the scheduler configured here? > > Thanks & Regards > Devaraj K > > -----Original Message----- > From: Arinto Murdopo [mailto:arinto@gmail.com] > Sent: Sunday, November 04, 2012 11:46 AM > To: yarn-dev@hadoop.apache.org > Subject: Re: CONTAINER_FINISHED event when RMAppAttemptImpl is RECOVERING > > Hi Arun, > > Thanks for the prompt reply. We need to test it for our school project > which scheduled to end in early December. So, we still need to continue. > > The YARN-128 discussion (https://issues.apache.org/jira/browse/YARN-128) > mentions that Devaraj is successfully test the RM resurrection. So in thi= s > case, how do test is? Do you kill and resurrect RM at random time? > > We are doing the resurrection using these following steps: > > 1. Run example MR jobs (such as the Pi computation) > 2. After the mapping and reducing process started, we kill the RM using > linux's kill command > 3. Then, we wait for 3 seconds before we resurrect it. > 4. We noticed that the mapping process is able to continue, and the job > stuck when the mapping process reaches 100%. At that time reduce process = is > still 0%. > > We also modified TestMRJobs.java to use ZKStore, and use > ResourceManagerWrapper to start and stop the ResourceManager > > regards, > > Arinto Murdopo > European Master in Distributed Computing (EMDC) > Universitat Polit=E8cnica de Catalunya =B7 BarcelonaTech, Barcelona, Spai= n > KTH Royal Institute of Technology, Stockholm, Sweden > Phone: +46 725 548 759 > > > > On Sat, Nov 3, 2012 at 7:04 PM, Arun C Murthy wrote= : > > > Arinto, > > > > Unfortunately, it's too early to try it yet, I'd wait for a little > longer > > to for it to stabilize - should be soon. > > > > Thanks for trying it and the feedback though! Much appreciated. > > > > Arun > > > > On Nov 3, 2012, at 6:55 AM, Arinto Murdopo wrote: > > > > > Hi all, > > > > > > We have this exception when we tried to resurrect ResourceManager usi= ng > > > ZKStore. We are using Hadoop version 2.0.2 Alpha RC2, with patch from > > > #YARN-128 issue (https://issues.apache.org/jira/browse/YARN-128). > > > > > > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid > > event: > > > CONTAINER_FINISHED at RECOVERING > > > at > > > > > > > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachin= eFa > ctory.java:301) > > > at > > > > > > > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineF= act > ory.java:43) > > > at > > > > > > > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doT= ran > sition(StateMachineFactory.java:443) > > > at > > > > > > > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptI= mpl > .handle(RMAppAttemptImpl.java:510) > > > at > > > > > > > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptI= mpl > .handle(RMAppAttemptImpl.java:83) > > > at > > > > > > > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$Application= Att > emptEventDispatcher.handle(ResourceManager.java:442) > > > at > > > > > > > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$Application= Att > emptEventDispatcher.handle(ResourceManager.java:423) > > > at > > > > > > > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.jav= a:1 > 26) > > > at > > > > > > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:7= 5) > > > at java.lang.Thread.run(Thread.java:662) > > > > > > Inspecting RMAppAttemptImpl, we noticed that the state transition > doesn't > > > handle CONTAINER_FINISHED event when it is in the RECOVERING state. S= o > in > > > this case, what is the correct transition to handle CONTAINER_FINISHE= D > > > event when we are in RECOVERING state? > > > > > > regards, > > > > > > Arinto Murdopo > > > European Master in Distributed Computing (EMDC) > > > Universitat Polit=E8cnica de Catalunya =B7 BarcelonaTech, Barcelona, = Spain > > > KTH Royal Institute of Technology, Stockholm, Sweden > > > > -- > > Arun C. Murthy > > Hortonworks Inc. > > http://hortonworks.com/ > > > > > > > > --e89a8f3baff780436a04ce00241b--