hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4233) NPE can happen in RMNMNodeInfo.
Date Tue, 08 May 2012 19:49:49 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270766#comment-13270766
] 

Robert Joseph Evans commented on MAPREDUCE-4233:
------------------------------------------------

@Bikas,

The situation we are in with one of our clusters is not an intermittent race.  The cluster
is kind of stuck in this case right now, although the race is also a possibility.

I was curious to see what were the situations where the scheduler's node list was updated.
 It is updated whenever there is a NODE_ADDED or a NODE_REMOVED event sent to the scheduler.
 The NODE_ADDED events happen when a node registers for the first time, when a node reconnects,
and when a node's status transitions from unhealth to healthy.  Similarly the NODE_REMOVED
event is sent when a node transitions from healthy to unhealthy, when the node is deactivated,
or when a node reconnects (it is removed and then added back in).  From that it appears that
scheduler is intended to only store the list of healthy nodes.  By contrast the list of nodes
in the RM Context is updated when nodes register, reconnect, or deactivate.  The difference
between the two is unhealthy/healthy transitions.

I did not dig much further to see if there was more of a disconnect between the two lists.
 Especially because in the other places that access either of these node lists they check
for null return values so I assumed that it was simply a missed null check here as well.
                
> NPE can happen in RMNMNodeInfo.
> -------------------------------
>
>                 Key: MAPREDUCE-4233
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4233
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4233.txt
>
>
> {noformat}
> Caused by: java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo.getLiveNodeManagers(RMNMInfo.java:96)
>         at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
>         at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
>         at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
>         at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:65)
>         at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:216)
>         at javax.management.StandardMBean.getAttribute(StandardMBean.java:358)
>         at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666)
> {noformat}
> Looks like rmcontext.getRMNodes() is not kept in sync with scheduler.getNodeReport(),
so that the report can be null even though the context still knowns about the node.
> The simple fix is to add in a null check.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message