aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Robinson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AURORA-651) perform_maintenance_hosts should not temporarily remove machines
Date Wed, 13 Aug 2014 18:06:12 GMT
David Robinson created AURORA-651:
-------------------------------------

             Summary: perform_maintenance_hosts should not temporarily remove machines
                 Key: AURORA-651
                 URL: https://issues.apache.org/jira/browse/AURORA-651
             Project: Aurora
          Issue Type: Task
          Components: Client
            Reporter: David Robinson


The aurora_admin tool provides the following drain/maintenance commands:

- start_maintenance_hosts

    The list of hosts is marked for maintenance, and will be de-prioritized
    from consideration for scheduling.  Note, they are not removed from
    consideration, and may still schedule tasks if resources are very scarce.
    Usually you would mark a larger set of machines for drain, and then do
    them in batches within the larger set, to help drained tasks not land on
    future hosts that will be drained shortly in subsequent batches.

- host_maintenance_status

    Print the drain status of each supplied host.

- perform_maintenance_hosts

    Asks the scheduler to remove any running tasks from the machine and remove it
    from service temporarily, perform some action on them, then return the machines
    to service.

- end_maintenance_hosts

    The list of hosts is marked as not in a drained state anymore.  This will
    allow normal scheduling to resume on the given list of hosts.

The command that actually drains a machine is the perform_maintenance_hosts command, however
it only drains a machine *temporarily*. As soon as the machine is drained it is placed back
into service, thereby allowing tasks to be scheduler on it. This default behavior is wrong.
The expected workflow is that the --post_drain_script option is used and the script is expected
to shutdown the slave, typically by SSHing in and stopping the mesos process. It's not obvious
that perform_maintenance_hosts's --post_drain_script must be used along with a script to properly
drain a machine, and the admin tool does not provide any other commands that could be used
to drain a machine *and leave it drained*.

The ideal solution is described in AURORA-43.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message