hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Narrell <matt.narr...@gmail.com>
Subject Re: YARN HA Active ResourceManager failover when machine is stopped
Date Mon, 27 Apr 2015 20:43:13 GMT
Yes, it looks like we’re running up against YARN-2578.  That’s very unfortunate.

Thanks for everyone’s investigation and input.

mn

> On Apr 26, 2015, at 10:38 PM, Rohith Sharma K S <rohithsharmaks@huawei.com> wrote:
> 
> Hi
>  
>      I had seen this issue in my cluster without HA configured when the process is Halted.
 I assume that your scenario also having similar issue when Active RM machine is Shutdown
abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.
>  
> Open JIRA’s in community regarding this problem are
> https://issues.apache.org/jira/i#browse/YARN-1061 <https://issues.apache.org/jira/i#browse/YARN-1061>
(Without HA)
> https://issues.apache.org/jira/i#browse/YARN-2578 <https://issues.apache.org/jira/i#browse/YARN-2578>
(With HA)
>  
>  
> Thanks & Regards
> Rohith Sharma K S
>  
> From: Matt Narrell [mailto:matt.narrell@gmail.com] 
> Sent: 24 April 2015 23:28
> To: user@hadoop.apache.org
> Subject: Re: YARN HA Active ResourceManager failover when machine is stopped
>  
> Also, another observation is that when the VMs are halted, its seems like the NodeManagers
do not consider this a scenario to round-robin among the configured ResourceManagers?  Is
there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining
in the case of the machine not responding (to distinguish it from a network blip)?
>  
> mn
>  
> On Apr 24, 2015, at 1:50 AM, Drake민영근 <drake.min@nexr.com <mailto:drake.min@nexr.com>>
wrote:
>  
> Hi, Matt
>  
> The second log file looks like node manager's log, not the standby resource manager.
>  
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
>  
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <mailto:matt.narrell@gmail.com>>
wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
>  
> Oppressively chatty and not much valuable info contained therein.
>  
>  
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <mailto:vinodkv@hortonworks.com>>
wrote:
>  
> I have run into this offline with someone else too but couldn't root-cause it.
>  
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>  
> +Vinod
>  
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <mailto:matt.narrell@gmail.com>>
wrote:
> 
> 
> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>  
> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager
(shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers
do not register themselves with the newly elected active ResourceManager. If I restart the
machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected
ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces
a SPOF, and is not HA in the sense I’m expecting.
>  
> Thanks,
> mn


Mime
View raw message