hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kuhu Shukla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
Date Wed, 24 Feb 2016 16:38:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163294#comment-15163294
] 

Kuhu Shukla commented on YARN-4723:
-----------------------------------

The primary reason for this failure is the {{UnknownNodeId}} object. Even if we do not put
this dummy nodeId in the active RMNodes, and instead put it in inactiveRMNodes, the transition
from NEW to DECOMMISSIONED that makes the node unusable(NODE_UNUSABLE) will trigger a NODE_UPDATE
which instead would populate the {{updatedNodes}} in the AllocateResponse.
{code}
  @Override
  public void handle(NodesListManagerEvent event) {
    RMNode eventNode = event.getNode();
    switch (event.getType()) {
    case NODE_UNUSABLE:
      LOG.debug(eventNode + " reported unusable");
      unusableRMNodesConcurrentSet.add(eventNode);
      for(RMApp app: rmContext.getRMApps().values()) {
        if (!app.isAppFinalStateStored()) {
          this.rmContext
              .getDispatcher()
              .getEventHandler()
              .handle(
                  new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
                      RMAppNodeUpdateType.NODE_UNUSABLE));
        }
      }
{code}

That being said, we should not add the node to active list, but the way to solve this problem
is to get rid of UnknownNodeId and have an anonymous classes to initialize these dummy nodes.

For the unit test, I did call {{allocate}} for this scenario but that did not replicate the
issue until I explicitly set the updatedNodes to an UnknownNodeId object. 

Asking [~jlowe], [~templedf] for comments and corrections.

Excerpt from a sample test :
{code}
AllocateRequest allocateRequest =
        Records.newRecord(AllocateRequest.class);
    AllocateResponse resp = rmClient.allocate(allocateRequest);
    NodeReport report = new NodeReportPBImpl();
    report.setNodeId(new NodesListManager.UnknownNodeId("host2"));
    List<NodeReport> reports = new ArrayList<NodeReport>();
    reports.add(report);
    resp.setUpdatedNodes(reports);
    allocateRequest =
        Records.newRecord(AllocateRequest.class);
    YarnServiceProtos.AllocateResponseProto p = ((AllocateResponsePBImpl) resp).getProto();
{code}

Proposed change in NodesListManager.java:
{code}
private void setDecomissionedNMs() {
    Set<String> excludeList = hostsReader.getExcludedHosts();
    for (final String host : excludeList) {
      NodeId nodeId = makeUnknownNodeId(host);
      RMNodeImpl rmNode = new RMNodeImpl(nodeId,
          rmContext, host, -1, -1, makeUnknownNode(host), null, null);
      rmContext.getInactiveRMNodes().putIfAbsent(rmNode.getNodeID().getHost(),rmNode);
      rmNode.handle(new RMNodeEvent(rmNode.getNodeID(), RMNodeEventType
          .DECOMMISSION));
    }
  }
{code}

{code}
  Node makeUnknownNode(final String host) {
    return new Node() {
      @Override
      public String getNetworkLocation() {
        return null;
      }

      @Override
      public void setNetworkLocation(String location) {

      }

      @Override
      public String getName() {
        return host;
      }

      @Override
      public Node getParent() {
        return null;
      }

      @Override
      public void setParent(Node parent) {

      }

      @Override
      public int getLevel() {
        return 0;
      }

      @Override
      public void setLevel(int i) {

      }
    };
  }
{code}

> NodesListManager$UnknownNodeId ClassCastException
> -------------------------------------------------
>
>                 Key: YARN-4723
>                 URL: https://issues.apache.org/jira/browse/YARN-4723
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.3
>            Reporter: Jason Lowe
>            Assignee: Kuhu Shukla
>            Priority: Critical
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC Server handler
5 on 8030, call org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId
cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
>         at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
>         at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
>         at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
>         at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
>         at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
>         at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
>         at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
>         at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
>         at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
>         at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
>         at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
>         at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>         at org.apache.hadoop.ipc.Server.call(Server.java:2267)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message