ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmytro Sen (JIRA)" <>
Subject [jira] [Commented] (AMBARI-4992) Sometimes cluster installation pauses for few minutes between tasks
Date Thu, 13 Mar 2014 14:03:45 GMT


Dmytro Sen commented on AMBARI-4992:


> Sometimes cluster installation pauses for few minutes between tasks
> -------------------------------------------------------------------
>                 Key: AMBARI-4992
>                 URL:
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent
>    Affects Versions: 1.5.0
>            Reporter: Vitaliy Semenyk
>            Assignee: Dmitry Lysnichenko
> h2. The problem
> Primarily affects pluggable (python-based) services.
> During cluster installation, there may be a few significant pauses between task execution.
At this time, the previous task shows ip as completed at UI, and the next task shows up as
not started yet. This effect may be noticed 1-3 times during installation when installing
entire cluster, taking in some cases around 3 minutes for one pause. 
> Initial analysis shows that this time is consumed by executing service checks that has
been queued during cluster installation. 
> h2. Some background:
> Server issues a big set of EXECUTION_COMMANDs at once few times during cluster installation.
Typically, all commands for one set are sent to agent at once. At agent, status and execution
commands are stored at the same queue. While cluster is installed, status commands are appended
to the end of the queue. So when the last command for INSTALL is completed, we have a large
number of status commands at the queue (hundreds?). Executing them may take around 3 minutes.
START commands that have been issued by the server will not be scheduled for execution until
all STATUS_COMMANDs at the queue are perform. At UI, installation it looks like installation
hang up.
> h2. Why it became noticeable at pluggable services:
>  It's due to few factors:
> - python services install faster
> - status commands ran a bit slower because we invoke a separate subprocess to determine
every status, and also perform more IO
> I've attached a relevant log (The interesting part is after text 
> {code}
> INFO 2013-12-18 13:43:44,163 - Sending heartbeat with response id: 419
and timestamp: 1387374224161. Command(s) in progress: True. Components mapped: True
> {code}
> Zookeeper start has been finished and after that,  only status commands have been executing
for few minutes (the START task for the next component just showed up as scheduled, but not
started yet at UI).
> h2. Selected solution
>  I prefer the approach of checking if the command queue is empty and then picking status
commands from last_status. It is better as it can be done every 2 seconds whereas status commands
are send by the server only every minute. I assume we still do not store duplicate commands
in last_status.

This message was sent by Atlassian JIRA

View raw message