ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Toader" <stoa...@hortonworks.com>
Subject Review Request 40924: During Upgrade Topology Manager Causes Ambari To Be Unresponsive With Infinite Loop
Date Thu, 03 Dec 2015 21:03:46 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40924/
-----------------------------------------------------------

Review request for Ambari, Oliver Szabo, Robert Nettleton, and Sandor Magyari.


Bugs: AMBARI-14188
    https://issues.apache.org/jira/browse/AMBARI-14188


Repository: ambari


Description
-------

1. Increased the interval for Cluster configuration request retries from 100 ms to 1 sec in
order to reduce the burden on the CPU caused by persistent failures.


2. When Ambari is (re)started verifies if there are any persisted cluster configuration requests
that were not completed and will replay those. The way it verifies if it has to create a cluster
configuration request is looking at the latest version of the cluster configs. If there is
none config type with tag=TOPOLOGY_RESOLVED than it will create a cluster configuration request.


When the cluster is provisioned using a Blueprint config types will have two version one with
tag=INITIAL and one with tag=TOPOLOGY_RESOLVED the later being the latest version (active).
Then upgrading the cluster to a different HDP version will update all config types creating
new versions with tag="version....". If Ambari is restarted at this stage it will look at
the active versions of the cluster configs. None of them being with tag=TOPOLOGY_RESOLVED
it will create a cluster configuration request. A cluster configuration task is scheduled
to handle the request. The logic that executes the tasks and tries to update configuration
types it will throw an exception saying that there is a config type already with tag=TOPOLOGY_RESOLVED
since this looks at all version not only at active one. This resulting in the retry mechanism
for Cluster configuration to keep retrying every 100ms for 30 min havign the side effect of
Ambari server being unresponsive.

Changed the logic that determines if cluster configuration request has to be replayed to look
at all existing versions of config types and verify if there at least one that went through
the INITIAL -> TOPOLOGY_RESOLVES transition.


Diffs
-----

  ambari-server/src/main/java/org/apache/ambari/server/state/Cluster.java 2afba7e 
  ambari-server/src/main/java/org/apache/ambari/server/state/DesiredConfig.java 0635284 
  ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java 7ced845

  ambari-server/src/main/java/org/apache/ambari/server/topology/AmbariContext.java 608e6ca

  ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java 9b6c9ad

  ambari-server/src/test/java/org/apache/ambari/server/state/DesiredConfigTest.java 93e3f07

  ambari-server/src/test/java/org/apache/ambari/server/topology/AmbariContextTest.java 254d3a3


Diff: https://reviews.apache.org/r/40924/diff/


Testing
-------

Manual testing:

1. Created HDP2.2 cluster with Blueprint
2. Upgraded cluster to HDP 2.3.2.0
3. Restarted Ambari Server
4. Verified that ambari server is not erroring in a loop which was causing it to become unresponsive

Unit test results:

Results :

Tests run: 3518, Failures: 0, Errors: 0, Skipped: 28


Thanks,

Sebastian Toader


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message