ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Lysnichenko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-4323) Add ability to an agent to clear the ActionQueue
Date Wed, 19 Feb 2014 21:52:26 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dmitry Lysnichenko updated AMBARI-4323:
---------------------------------------

    Description: 
h2. Implementation proposal:
1. Add a new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND contains
identifier (task_id + stage_id) of an exact command for cancellation and an arbitrary text
string (reasoning for command cancelation).  So CANCEL_COMMAND looks like 

{code}
{
  target_task_id: "4-3"
  reason: "Aborted by user via API"
}
{code}

2. At the server side, commands of this type are issued automagically when tasks are considered
timed out. I'm going to do that here: org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
Also we will implement (in a separate jira) an ability to cancel arbitrary order via server
API. A new method addCancelCommandAction() at org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java
will become the endpoint that forms up a new CANCEL_COMMAND.

3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right after arrival
(they are not put into ActionQueue). If command mentioned by the CANCEL_COMMAND is not present
in the ActionQueue (it is already in progress or completed) and command is not IN_PROGRESS,
CANCEL_COMMAND is silently ignored. After executing  CANCEL_COMMAND, agent starts executing
next EXECUTION_COMMAND from the ActionQueue.

4. Also, agent clears entire action queue on every registration (disconnected from the server
or the re-registration is requested). I'm going to add an appropriate logic to src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat.
The motivation for doing that is to make a recovery from the network/server fail more reliable
and fast (agent will have an empty ActionQueue and will be able to execute new EXECUTION_COMMANDS
and STATUS_COMMANDS immediately after registration). Currently, after re-registration agent
is locked up and continues to execute stale EXECUTION_COMMANDS.

5. -In both cases described above (executing a single CANCEL_COMMAND or clearing entire ActionQueue)
EXECUTION_COMMANDS are considered transactional-like. I mean that EXECUTION_COMMANDs that
are already IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system
in misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is received, EXECUTION_COMMAND
is cancelled even if it is IN_PROGRESS. 

6. Agent forms up command reports for cancelled commands just like it is done for COMPLETE
and FAILED commands. Command statuses for cancelled commands are set to FAILED. I did not
find enough reasoning for adding a new command report state CANCELED, feedback is welcome.
Reasoning text (why command has been cancelled) is appended to command stderr and to command
stdout.

So, cancelled command report looks like:

{code}
{
  taskId: "4-3"
  status : FAILED
  stderr : ".... some text ... \n Command was aborted because of: Aborted by user via API
"
  stdout : ".... some text ... \n Command was aborted because of: Aborted by user via API
"
  exitcode: 999
}
{code}

Also, I'm going to fix a bug at org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage
. Here, we pass a stage timeout instead of task timeout as a parameter for org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded
. After bugfix, task timeout + some small time will be passed as a parameter value. Additional
smal time (10-30 seconds) is needed to avoid sending CANCEL_COMMAND when it is not absolutely
necessary (task timeouts at agent automatically without server actions in most cases).

This implementation should also solve another related jira AMBARI-4324

  was:
Implementation proposal:
1. Add new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND contains identifier
(task_id + stage_id) of an exact command for cancellation.
2. At the server side, commands of this type are issued when tasks are considered timed out.
I'm going to do that here: org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right after arrival
(they are not put into ActionQueue). If command mentioned by the CANCEL_COMMAND is not present
in ActionQueue (it is already in progress or completed), CANCEL_COMMAND is silently ignored.
4. Also, agent clears entire action queue when it can not continue exchanging heartbeats with
the server (disconnect or registration requested). I'm going to add an appropriate logic to
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The motivation is
to make recovery from network/server fail more reliable and fast (agent will have an empty
ActionQueue and can start executing new EXECUTION_COMMANDS and STATUS_COMMANDS right after
registration).
5. In both cases described above (executing a single CANCEL_COMMAND or clearing entire ActionQueue)
EXECUTION_COMMANDS are considered transactional-like. I mean that EXECUTION_COMMANDs that
are already IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system
in misconfigured/unpredictable state.
Also, I'm going to fix a bug at org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage
. Here, we pass stage timeout instead of task timeout as a parameter of org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded
. After bugfix, task timeout + some small time will be passed as a parameter value. Additional
smal time (10-30 seconds) is needed to avoid sending CANCEL_COMMAND without absolute necessary
(task will timeout at agent automatically in most cases).


> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>
>                 Key: AMBARI-4323
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4323
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> h2. Implementation proposal:
> 1. Add a new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND contains
identifier (task_id + stage_id) of an exact command for cancellation and an arbitrary text
string (reasoning for command cancelation).  So CANCEL_COMMAND looks like 
> {code}
> {
>   target_task_id: "4-3"
>   reason: "Aborted by user via API"
> }
> {code}
> 2. At the server side, commands of this type are issued automagically when tasks are
considered timed out. I'm going to do that here: org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
Also we will implement (in a separate jira) an ability to cancel arbitrary order via server
API. A new method addCancelCommandAction() at org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java
will become the endpoint that forms up a new CANCEL_COMMAND.
> 3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right after arrival
(they are not put into ActionQueue). If command mentioned by the CANCEL_COMMAND is not present
in the ActionQueue (it is already in progress or completed) and command is not IN_PROGRESS,
CANCEL_COMMAND is silently ignored. After executing  CANCEL_COMMAND, agent starts executing
next EXECUTION_COMMAND from the ActionQueue.
> 4. Also, agent clears entire action queue on every registration (disconnected from the
server or the re-registration is requested). I'm going to add an appropriate logic to src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat.
The motivation for doing that is to make a recovery from the network/server fail more reliable
and fast (agent will have an empty ActionQueue and will be able to execute new EXECUTION_COMMANDS
and STATUS_COMMANDS immediately after registration). Currently, after re-registration agent
is locked up and continues to execute stale EXECUTION_COMMANDS.
> 5. -In both cases described above (executing a single CANCEL_COMMAND or clearing entire
ActionQueue) EXECUTION_COMMANDS are considered transactional-like. I mean that EXECUTION_COMMANDs
that are already IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system
in misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is received, EXECUTION_COMMAND
is cancelled even if it is IN_PROGRESS. 
> 6. Agent forms up command reports for cancelled commands just like it is done for COMPLETE
and FAILED commands. Command statuses for cancelled commands are set to FAILED. I did not
find enough reasoning for adding a new command report state CANCELED, feedback is welcome.
Reasoning text (why command has been cancelled) is appended to command stderr and to command
stdout.
> So, cancelled command report looks like:
> {code}
> {
>   taskId: "4-3"
>   status : FAILED
>   stderr : ".... some text ... \n Command was aborted because of: Aborted by user via
API "
>   stdout : ".... some text ... \n Command was aborted because of: Aborted by user via
API "
>   exitcode: 999
> }
> {code}
> Also, I'm going to fix a bug at org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage
. Here, we pass a stage timeout instead of task timeout as a parameter for org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded
. After bugfix, task timeout + some small time will be passed as a parameter value. Additional
smal time (10-30 seconds) is needed to avoid sending CANCEL_COMMAND when it is not absolutely
necessary (task timeouts at agent automatically without server actions in most cases).
> This implementation should also solve another related jira AMBARI-4324



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message