ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Sen (JIRA)" <>
Subject [jira] [Commented] (AMBARI-2713) Perf: Installer host registration stuck in 'Installing' for 10mins before succeeding
Date Wed, 24 Jul 2013 13:09:48 GMT


Dmitry Sen commented on AMBARI-2713:

> Perf: Installer host registration stuck in 'Installing' for 10mins before succeeding
> ------------------------------------------------------------------------------------
>                 Key: AMBARI-2713
>                 URL:
>             Project: Ambari
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 1.2.5
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.4.0
>         Attachments: AMBARI-2713.patch
> I've deployed a 3-node cluster and tried to bootstrap 4 nodes. 2 nodes were 'bad' for
different reasons
> {code}
> host1.internal
> host2.internal
> host3.internal - python executables are left without x-bit
> host4.internal - non existent node, does not ping
> {code}
> Also I've configured an intensive logging with timestamps. During bootstrap, all 4 nodes
stuck for ~5 minutes in Installing state. After that, 2 nodes became failed and other switched
to Registering state.
> The root problem is that scp operation to not-existent node takes 1 minute before connection
timeout. Also.
> - all parallel scp operations are performed in up to 20 threads at once. If there are
more hosts in a list, list is splitted to chunks. Next chunk launches when the previous ends.
The same thing for ssp.
> - next operations are performed only when all previous parallel ssh/scp operation completes.
> - done files for host are completed at last step of bootstrap, for all hosts at once.

> That's why, if we have overall 174 hosts and 26 of them are off/inaccessible/not configured
for pubkey auth:
> - 174 hosts are splitted to 9 chunks of 20 hosts at initial scp operation. In every chunk
there will be ~3 dead hosts. So at every chunk, we have to wait for ~1 minute before dead
hosts time out, 9 minutes overall.
> - 148 hosts that completed scp will continue bootstrap and finish in few minutes.
> - when all 148 hosts finish bootstrap, done files are created for all 174 hosts . 
> - server reads exits status for 174 hosts and consider bootstrap completed. That is reflected
at API.
> The described behaviour is not a bug but rather the way currently works.

> Possible solutions:
> - completely redesign

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message