Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D769C429 for ; Tue, 12 Aug 2014 05:14:33 +0000 (UTC) Received: (qmail 39389 invoked by uid 500); 12 Aug 2014 05:14:24 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 39270 invoked by uid 500); 12 Aug 2014 05:14:24 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 39259 invoked by uid 99); 12 Aug 2014 05:14:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Aug 2014 05:14:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of xgong@hortonworks.com designates 209.85.216.171 as permitted sender) Received: from [209.85.216.171] (HELO mail-qc0-f171.google.com) (209.85.216.171) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Aug 2014 05:13:57 +0000 Received: by mail-qc0-f171.google.com with SMTP id r5so2479692qcx.30 for ; Mon, 11 Aug 2014 22:13:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=zAR8eKY2p7812ZTU46CtkUUcw/vGwcpChUU50SzKR8Q=; b=lv5QSVXOQkCkOfY1zul9cSB+a9ohmK7yO5zpZD7vHJygNODxzjIRQbOfO5UzmIe7ia 2R1ywdc4KlQQBIwQ+ImV9LW/3Z98CEgXQMsOxPtgu9+4KcjDmu+cq3WfUDDSpBjKOrrK 27WX95OPdPvDQngLVY1W34CMVHblPgSYS8Efqx3Q+VAAzdldR3Z+V+nDt5Snc/tFlqes A9BmBuJDsNWXZwF9HJcKQRRTfDeHYdoLnQ/DmA6/3OXGS7lrG9noaJhEue9Ida6PHXQg NdGexbRJERHGBMCyj74mdOV+TgRo8Gi5avDLntv+OzooX64kB5WkAsQbUyiMYW9og8uy rV1w== X-Gm-Message-State: ALoCoQl3kk5yYqSqeQpjyMQumHsnMMqvmZgjqTeLYHMwX/ihs0EIKXjcFz6lbzu+FZNcD5uC0vjWkp+prCzhi9fFhILA0T1+jFPosc/7oUZ6DivpRLy0j4k= MIME-Version: 1.0 X-Received: by 10.224.129.6 with SMTP id m6mr3063660qas.84.1407820436404; Mon, 11 Aug 2014 22:13:56 -0700 (PDT) Received: by 10.140.92.87 with HTTP; Mon, 11 Aug 2014 22:13:56 -0700 (PDT) In-Reply-To: <8AB2BC5B-B2E4-4E5A-B2FA-7B6DA4B77E1F@gmail.com> References: <7FBA0F26-BC12-4F22-8723-7A07FE9D6488@gmail.com> <2AAC9E88-C599-45D8-8BF7-2BF86DB24054@gmail.com> <8AD4EE147886274A8B495D6AF407DF694B23CD51@szxeml510-mbx.china.huawei.com> <88F35DA0-0731-4F31-BAFD-8FE8DC91E0FB@gmail.com> <8AD4EE147886274A8B495D6AF407DF694B23CD94@szxeml510-mbx.china.huawei.com> <3A1C4DEC-C51E-4E71-A7BC-435E13D5F627@gmail.com> <8AB2BC5B-B2E4-4E5A-B2FA-7B6DA4B77E1F@gmail.com> Date: Mon, 11 Aug 2014 22:13:56 -0700 Message-ID: Subject: Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager From: Xuan Gong To: user@hadoop.apache.org Cc: "Arthur.hk.chan@gmail.com" Content-Type: multipart/alternative; boundary=001a11c3f208050d18050067bd8b X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3f208050d18050067bd8b Content-Type: text/plain; charset=UTF-8 Some questions: Q1) I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well? No, need to start multiple RMs separately. Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager? Interesting question. But one of the design for auto-failover is that the down-time of RM is invisible to end users. The end users can submit applications normally even if the failover happens. We can monitor the status of RMs by using the command-line (you did previously) or from webUI/webService (rm_address:portnumber/cluster/cluster). We can get the current status from there. Thanks Xuan Gong On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com < arthur.hk.chan@gmail.com> wrote: > Hi, > > it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my > yarn-site.xml. > > At the moment, the ResourceManager HA works if: > > 1) at rm1, run ./sbin/start-yarn.sh > > yarn rmadmin -getServiceState rm1 > active > > yarn rmadmin -getServiceState rm2 > 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/ > 192.168.1.1:23142. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 > MILLISECONDS) > Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on > connection exception: java.net.ConnectException: Connection refused; For > more details see: http://wiki.apache.org/hadoop/ConnectionRefused > > > 2) at rm2, run ./sbin/start-yarn.sh > > yarn rmadmin -getServiceState rm1 > standby > > > Some questions: > Q1) I need start yarn in EACH master separately, is this normal? Is there > a way that I just run ./sbin/start-yarn.sh in rm1 and get the > STANDBY ResourceManager in rm2 started as well? > > Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is > down in an auto-failover env? or how do you monitor the status of > ACTIVE/STANDBY ResourceManager? > > > Regards > Arthur > > > > > > > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > > yarn.resourcemanager.address > 192.168.1.1:8032 > > > > yarn.resourcemanager.resource-tracker.address > 192.168.1.1:8031 > > > > yarn.resourcemanager.admin.address > 192.168.1.1:8033 > > > > yarn.resourcemanager.scheduler.address > 192.168.1.1:8030 > > > > yarn.nodemanager.loacl-dirs > /edh/hadoop_data/mapred/nodemanager > true > > > > yarn.web-proxy.address > 192.168.1.1:8888 > > > > yarn.nodemanager.aux-services.mapreduce.shuffle.class > org.apache.hadoop.mapred.ShuffleHandler > > > > > > > yarn.nodemanager.resource.memory-mb > 18432 > > > > yarn.scheduler.minimum-allocation-mb > 9216 > > > > yarn.scheduler.maximum-allocation-mb > 18432 > > > > > > yarn.resourcemanager.connect.retry-interval.ms > 2000 > > > yarn.resourcemanager.ha.enabled > true > > > yarn.resourcemanager.ha.automatic-failover.enabled > true > > > yarn.resourcemanager.ha.automatic-failover.embedded > true > > > yarn.resourcemanager.cluster-id > cluster_rm > > > yarn.resourcemanager.ha.rm-ids > rm1,rm2 > > > yarn.resourcemanager.hostname.rm1 > 192.168.1.1 > > > yarn.resourcemanager.hostname.rm2 > 192.168.1.2 > > > yarn.resourcemanager.scheduler.class > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler > > > yarn.resourcemanager.recovery.enabled > true > > > yarn.resourcemanager.store.class > > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore > > > yarn.resourcemanager.zk-address > rm1:2181,m135:2181,m137:2181 > > > > yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms > 5000 > > > > > yarn.resourcemanager.address.rm1 > 192.168.1.1:23140 > > > yarn.resourcemanager.scheduler.address.rm1 > 192.168.1.1:23130 > > > yarn.resourcemanager.webapp.https.address.rm1 > 192.168.1.1:23189 > > > yarn.resourcemanager.webapp.address.rm1 > 192.168.1.1:23188 > > > yarn.resourcemanager.resource-tracker.address.rm1 > 192.168.1.1:23125 > > > yarn.resourcemanager.admin.address.rm1 > 192.168.1.1:23142 > > > > > > yarn.resourcemanager.address.rm2 > 192.168.1.2:23140 > > > yarn.resourcemanager.scheduler.address.rm2 > 192.168.1.2:23130 > > > yarn.resourcemanager.webapp.https.address.rm2 > 192.168.1.2:23189 > > > yarn.resourcemanager.webapp.address.rm2 > 192.168.1.2:23188 > > > yarn.resourcemanager.resource-tracker.address.rm2 > 192.168.1.2:23125 > > > yarn.resourcemanager.admin.address.rm2 > 192.168.1.2:23142 > > > > yarn.nodemanager.remote-app-log-dir > /edh/hadoop_logs/hadoop/ > > > > > > > On 12 Aug, 2014, at 1:49 am, Xuan Gong wrote: > > Hey, Arthur: > > Did you use single node cluster or multiple nodes cluster? Could you > share your configuration file (yarn-site.xml) ? This looks like a > configuration issue. > > Thanks > > Xuan Gong > > > On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com < > arthur.hk.chan@gmail.com> wrote: > >> Hi, >> >> If I have TWO nodes for ResourceManager HA, what should be the correct >> steps and commands to start and stop ResourceManager in a ResourceManager >> HA cluster ? >> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems >> that ./sbin/start-yarn.sh can only start YARN in a node at a time. >> >> Regards >> Arthur >> >> >> > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. --001a11c3f208050d18050067bd8b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Some questions:
Q1) =C2=A0I need start yarn in EACH master separat= ely, is this normal? Is there a way that I just run=C2=A0./sbin/start-yarn.sh=C2=A0in rm1 and get the STAN= DBY=C2=A0ResourceManager in rm2 started as well?

No,= need to start multiple RMs separately.

Q2) How to get alerts (e.g. by email) if the ACTIVE=C2=A0Resou= rceManager=C2=A0is down in an auto-failover env? or how do you monitor the = status of ACTIVE/STANDBY=C2=A0ResourceManager?=C2=A0

Int= eresting question. But one of the design for auto-failover is that the down= -time of RM is invisible to end users. The end users can submit application= s normally even if the failover happens.=C2=A0

We can monitor the sta= tus of RMs by using the command-line (you did previously) or from webUI/web= Service (rm_address:portnumber/cluster/cluster). We can get the current sta= tus from there.

Thanks

Xuan Gong
<configuratio= n>

<!-- Site specific YARN configuration properties -->

=
=C2=A0=C2=A0 <= ;property>
=C2=A0 =C2=A0 = =C2=A0 <name>yarn.nodemanager.aux-services</name>
=C2=A0 =C2=A0 =C2=A0 <= ;value>mapreduce_shuffle</value>
=C2=A0=C2=A0 <= ;/property>

=C2=A0=C2=A0 <property>
=C2=A0 =C2=A0 =C2=A0 <name>yarn.resourcemanager.ad= dress</name>
=C2=A0=C2=A0 <= ;/property>

=C2=A0=C2=A0 <property>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <name>yarn.resourcemana= ger.resource-tracker.address</name>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <value>192.168.1.1:8031</value>
=C2=A0=C2=A0 </property>

=C2=A0= =C2=A0 <property>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <name>yarn.resourcemanager.ad= min.address</name>
=C2=A0=C2=A0 =C2= =A0 =C2=A0 <value>192.168.1.1:8033</value>
=C2=A0=C2=A0 </property>

=C2=A0=C2=A0 <property>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <name>yarn.resourcemanager.scheduler.addre= ss</name>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <value>192.168.1.1:8030</value>
=C2=A0=C2=A0 <= ;/property>

=C2=A0=C2=A0 <property>
=C2=A0 =C2=A0 =C2=A0 <name>yarn.nodemanager.loacl-= dirs</name>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <value>/edh/hadoop_data/mapred/node= manager</value>
=C2=A0=C2=A0 =C2= =A0 =C2=A0 <final>true</final>
=C2=A0=C2=A0 </property>

=C2=A0= =C2=A0 <property>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <name>yarn.web-proxy.address&= lt;/name>
=C2=A0=C2=A0 =C2=A0 =C2=A0 <value>192.168.1.1:8888</value>
=C2=A0=C2=A0 </property>

=C2=A0= =C2=A0 <property>
=C2=A0 =C2=A0 =C2=A0 <name>yarn.nodemanager.aux-services= .mapreduce.shuffle.class</name>
=C2=A0 =C2=A0 = =C2=A0 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
=C2=A0=C2=A0= </property>




=C2=A0= =C2=A0 <property>
=C2=A0 =C2=A0 =C2=A0 <name>yarn.nodemanager.resource.mem= ory-mb</name>
=C2=A0 =C2=A0 = =C2=A0 <value>18432</value>
=C2=A0=C2=A0 </property>

=C2=A0= =C2=A0 <property>
=C2=A0 =C2=A0 =C2=A0 <name>yarn.scheduler.minimum-alloca= tion-mb</name>
=C2=A0 =C2=A0 = =C2=A0 <value>9216</value>
=C2=A0=C2=A0 </property>

=C2=A0= =C2=A0 <property>
=C2=A0 =C2=A0 =C2=A0 <name>yarn.scheduler.maximum-alloca= tion-mb</name>
=C2=A0 =C2=A0 = =C2=A0 <value>18432</value>
=C2=A0=C2=A0 </property>



=C2=A0 <property>
=C2=A0 =C2=A0 &l= t;value>2000</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.ha.enabled</= name>
= =C2=A0 =C2=A0 <value>true</value>
=C2=A0 </prop= erty>
= =C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.ha.automatic-fa= ilover.enabled</name>
=C2=A0 =C2=A0 &l= t;value>true</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.ha.automatic-fa= ilover.embedded</name>
=C2=A0 =C2=A0 <value>true</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.cluster-id</name>
=C2=A0 =C2=A0 &= lt;value>cluster_rm</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.h= a.rm-ids</name>
=C2=A0 =C2=A0 &l= t;value>rm1,rm2</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.hostname.rm1<= ;/name>
= =C2=A0 =C2=A0 <value>192.168.1.1</value>
=C2=A0 </prop= erty>
= =C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.hostname.rm2<= ;/name>
=C2=A0 =C2=A0 &l= t;value>192.168.1.2</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.scheduler.class= </name>
=C2=A0 =C2=A0 <value>org.apache.hadoop.yarn.server.resourcemanager= .scheduler.fair.FairScheduler</value>
=C2=A0 </prop= erty>
= =C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.recovery.enable= d</name>
=C2=A0 =C2=A0 &l= t;value>true</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.store.class<= /name>
= =C2=A0 =C2=A0 <value>org.apache.hadoop.yarn.server.resourcemanager.re= covery.ZKRMStateStore</value>
=C2=A0 </prop= erty>
= =C2=A0 <property>
=C2=A0 =C2=A0 =C2=A0 <name>yarn.resourcemanager.zk-addre= ss</name>
=C2=A0 =C2=A0 = =C2=A0 <value>rm1:2181,m135:2181,m137:2181</value>
=C2=A0 </property>= ;
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.app.mapreduce.am.scheduler.conn= ection.wait.interval-ms</name>
=C2=A0 =C2=A0 <value>5000</value>
=C2=A0 </property>

=
=C2=A0 <!-- RM1 configs -->
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resource= manager.address.rm1</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .1:23140</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.s= cheduler.address.rm1</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .1:23130</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.w= ebapp.https.address.rm1</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .1:23189</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.w= ebapp.address.rm1</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .1:23188</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.r= esource-tracker.address.rm1</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .1:23125</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.a= dmin.address.rm1</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .1:23142</value>
=C2=A0 </property>


=C2=A0 <!-- RM2 configs -->
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resource= manager.address.rm2</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .2:23140</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.s= cheduler.address.rm2</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .2:23130</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.w= ebapp.https.address.rm2</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .2:23189</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.w= ebapp.address.rm2</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .2:23188</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.r= esource-tracker.address.rm2</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .2:23125</value>
=C2=A0 </property>
=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.resourcemanager.a= dmin.address.rm2</name>
=C2=A0 =C2=A0 &l= t;value>192.168.1= .2:23142</value>
=C2=A0 </property>

=C2=A0 <property>
=C2=A0 =C2=A0 <name>yarn.nodemanager.remote-app-log-dir</name><= /div>
=C2=A0 =C2= =A0 <value>/edh/hadoop_logs/hadoop/</value>
=C2=A0 </property>

</configuration>



On 12 Aug, 2014, at 1:49 am, Xuan Gong <xgong@hortonworks.com> wrote:
Hey, Arthur:

<= div>=C2=A0 =C2=A0 Did you use single node cluster or multiple nodes cluster= ? Could you share your configuration file (yarn-site.xml) ? This looks like= a configuration issue.=C2=A0

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at= 9:45 AM, Art= hur.hk.chan@gmail.com <arthur.hk.chan@gmail.com> = wrote:
Hi,

If I have TWO nodes for=C2=A0ResourceManager HA, what should be the correct= steps and commands to start and stop=C2=A0ResourceManager in a ResourceMan= ager HA cluster ?
Unlike=C2=A0./sbin/start-dfs.sh=C2=A0(which can start all NNs from= a NN), it seems that =C2=A0./sbin/start-yarn.sh=C2=A0can only start YARN= in a node at a time.

Regards
Arthur





CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u. --001a11c3f208050d18050067bd8b--