ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (Jira)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-25620) Ambari Server STOP command on a node might fail because of a decommissioned NM's behaviour
Date Mon, 22 Feb 2021 17:18:00 GMT
ramkrishna.s.vasudevan created AMBARI-25620:
-----------------------------------------------

             Summary: Ambari Server STOP command on a node might fail because of a decommissioned
NM's behaviour
                 Key: AMBARI-25620
                 URL: https://issues.apache.org/jira/browse/AMBARI-25620
             Project: Ambari
          Issue Type: Bug
            Reporter: ramkrishna.s.vasudevan


As part of our use case, before we STOP any set of components in a node, if we see a NodeManager
and Datanode in that node, we first Decommission them and then issue STOP components to all
the components in the node.
* DataNodes once decommissioned are not STOPPED but they are alive as a process. Unless we
stop them. Whereas RM stops the NM on decommission. (this is a known behaviour). But what
happens is that the RecoveryManager in ambari keeps restarting the service thinking its DESIRED
state (in the agent side) is STARTED. So the restart keeps happening. So the state changes
between STARTED <-> INSTALLED on the agent side and once this happens we communicate
the Component status to the server side.
* On receiving this update the server sets the STATE as STARTED/INSTALLED as the case may
be.
* Now coming back to the actual STOP command request that we gave, as per design in ambari
server once all the component updates are sent, it processes them in batch and tries to do
the in-memory transition of STATES on the server side (not the cache but the FSM (state machine
transition). Here the event is INSTALL/STOP event for NM that the server is expecting but
instead of getting an INSTALLED state it gets STARTED state. The reason as highlighted above.
So the entire STOP command gets aborted by the server thinking there is some problem in what
it sees.
* 
{code:java}
/Multimap is analog of Map<Object, List<Object>> but allows to avoid nested loop
        ListMultimap<String, ServiceComponentHostEvent> eventMap = formEventMap(stage,
commandsToStart);
        Map<ExecutionCommand, String> commandsToAbort = new HashMap<>();
        if (!eventMap.isEmpty()) {
          LOG.debug("==> processing {} serviceComponentHostEvents...", eventMap.size());
          Cluster cluster = clusters.getCluster(stage.getClusterName());
          if (cluster != null) {
            Map<ServiceComponentHostEvent, String> failedEvents = cluster.processServiceComponentHostEvents(eventMap);

            if (failedEvents.size() > 0) {
              LOG.error("==> {} events failed.", failedEvents.size());
            }

            for (Iterator<ExecutionCommand> iterator = commandsToUpdate.iterator();
iterator.hasNext(); ) {
              ExecutionCommand cmd = iterator.next();
              for (ServiceComponentHostEvent event : failedEvents.keySet()) {
                if (StringUtils.equals(event.getHostName(), cmd.getHostname()) &&
                  StringUtils.equals(event.getServiceComponentName(), cmd.getRole())) {
                  iterator.remove();
                  commandsToAbort.put(cmd, failedEvents.get(event));
                  break;
                }
              }
            }
{code}
* Check the processServiceComponentHostEvents() for the way the transition happens and what
is the Invalid Transition that happens over there. The log msg would be like this

{code:java}
org.apache.ambari.server.state.fsm.InvalidStateTransitionException: Invalid event: HOST_SVCCOMP_INSTALL
at STARTED
{code}
Since this entire set of STOP component is considered as a FAILURe, we issue ABORT command
and hence all the STOP command issued to the agent are aborted.
This makes the DN to stay in the STARTED state itself and hence the remaining DELETE HOST
command keeps failing. 
The idea is to ensure that for NM if decommissioned and the current state is STARTED for a
HOST_SVCCOMP_INSTALL  then mark it as not a failure condition so that the commands are not
aborted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message