ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <>
Subject [jira] [Commented] (AMBARI-4323) Add ability to an agent to clear the ActionQueue
Date Fri, 21 Feb 2014 23:17:19 GMT


Mahadev konar commented on AMBARI-4323:

Looks good to me - the proposal.

> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>                 Key: AMBARI-4323
>                 URL:
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
> h2. Implementation proposal:
> 1. Add a new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND contains
identifier (task_id + stage_id) of an exact command for cancellation and an arbitrary text
string (reasoning for command cancelation).  So CANCEL_COMMAND looks like 
> {code}
> {
>   target_task_id: "4-3"
>   reason: "Aborted by user via API"
> }
> {code}
> 2. At the server side, commands of this type are issued automagically when tasks are
considered timed out. I'm going to do that here: org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
Also we will implement (in a separate jira) an ability to cancel arbitrary order via server
API. A new method addCancelCommandAction() at org/apache/ambari/server/controller/
will become the endpoint that forms up a new CANCEL_COMMAND.
> 3. At the agent side, CANCEL_COMMANDs are executed inside right after arrival
(they are not put into ActionQueue). If command mentioned by the CANCEL_COMMAND is not present
in the ActionQueue (it is already in progress or completed) and command is not IN_PROGRESS,
CANCEL_COMMAND is silently ignored. After executing  CANCEL_COMMAND, agent starts executing
next EXECUTION_COMMAND from the ActionQueue.
> 4. Also, agent clears entire action queue on every registration (disconnected from the
server or the re-registration is requested). I'm going to add an appropriate logic to src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat.
The motivation for doing that is to make a recovery from the network/server fail more reliable
and fast (agent will have an empty ActionQueue and will be able to execute new EXECUTION_COMMANDS
and STATUS_COMMANDS immediately after registration). Currently, after re-registration agent
is locked up and continues to execute stale EXECUTION_COMMANDS.
> 5. -In both cases described above (executing a single CANCEL_COMMAND or clearing entire
ActionQueue) EXECUTION_COMMANDS are considered transactional-like. I mean that EXECUTION_COMMANDs
that are already IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system
in misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is received, EXECUTION_COMMAND
is cancelled even if it is IN_PROGRESS. 
> 6. Agent forms up command reports for cancelled commands just like it is done for COMPLETE
and FAILED commands. Command statuses for cancelled commands are set to FAILED. I did not
find enough reasoning for adding a new command report state CANCELED, feedback is welcome.
Reasoning text (why command has been cancelled) is appended to command stderr and to command
> So, cancelled command report looks like:
> {code}
> {
>   taskId: "4-3"
>   status : FAILED
>   stderr : ".... some text ... \n Command was aborted because of: Aborted by user via
>   stdout : ".... some text ... \n Command was aborted because of: Aborted by user via
>   exitcode: 999
> }
> {code}
> Also, I'm going to fix a bug at org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage
. Here, we pass a stage timeout instead of task timeout as a parameter for org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded
. After bugfix, task timeout + some small time will be passed as a parameter value. Additional
smal time (10-30 seconds) is needed to avoid sending CANCEL_COMMAND when it is not absolutely
necessary (task timeouts at agent automatically without server actions in most cases).
> This implementation should also solve another related jira AMBARI-4324

This message was sent by Atlassian JIRA

View raw message