ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Nettleton" <rnettle...@hortonworks.com>
Subject Re: Review Request 40924: During Upgrade Topology Manager Causes Ambari To Be Unresponsive With Infinite Loop
Date Fri, 04 Dec 2015 17:41:41 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40924/#review109002
-----------------------------------------------------------

Ship it!


Looks ok to me. 

I do think we should ask Sumit Mohanty or Sid Wagle to review this as well, to make sure the
Cluster changes around the desired configuration API are correct. 

Thanks.

- Robert Nettleton


On Dec. 3, 2015, 9:03 p.m., Sebastian Toader wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40924/
> -----------------------------------------------------------
> 
> (Updated Dec. 3, 2015, 9:03 p.m.)
> 
> 
> Review request for Ambari, Oliver Szabo, Robert Nettleton, and Sandor Magyari.
> 
> 
> Bugs: AMBARI-14188
>     https://issues.apache.org/jira/browse/AMBARI-14188
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> 1. Increased the interval for Cluster configuration request retries from 100 ms to 1
sec in order to reduce the burden on the CPU caused by persistent failures.
> 
> 
> 2. When Ambari is (re)started verifies if there are any persisted cluster configuration
requests that were not completed and will replay those. The way it verifies if it has to create
a cluster configuration request is looking at the latest version of the cluster configs. If
there is none config type with tag=TOPOLOGY_RESOLVED than it will create a cluster configuration
request. 
> 
> When the cluster is provisioned using a Blueprint config types will have two version
one with tag=INITIAL and one with tag=TOPOLOGY_RESOLVED the later being the latest version
(active). Then upgrading the cluster to a different HDP version will update all config types
creating new versions with tag="version....". If Ambari is restarted at this stage it will
look at the active versions of the cluster configs. None of them being with tag=TOPOLOGY_RESOLVED
it will create a cluster configuration request. A cluster configuration task is scheduled
to handle the request. The logic that executes the tasks and tries to update configuration
types it will throw an exception saying that there is a config type already with tag=TOPOLOGY_RESOLVED
since this looks at all version not only at active one. This resulting in the retry mechanism
for Cluster configuration to keep retrying every 100ms for 30 min havign the side effect of
Ambari server being unresponsive.
> 
> Changed the logic that determines if cluster configuration request has to be replayed
to look at all existing versions of config types and verify if there at least one that went
through the INITIAL -> TOPOLOGY_RESOLVES transition.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/state/Cluster.java 2afba7e 
>   ambari-server/src/main/java/org/apache/ambari/server/state/DesiredConfig.java 0635284

>   ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
7ced845 
>   ambari-server/src/main/java/org/apache/ambari/server/topology/AmbariContext.java 608e6ca

>   ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
9b6c9ad 
>   ambari-server/src/test/java/org/apache/ambari/server/state/DesiredConfigTest.java 93e3f07

>   ambari-server/src/test/java/org/apache/ambari/server/topology/AmbariContextTest.java
254d3a3 
> 
> Diff: https://reviews.apache.org/r/40924/diff/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 
> 1. Created HDP2.2 cluster with Blueprint
> 2. Upgraded cluster to HDP 2.3.2.0
> 3. Restarted Ambari Server
> 4. Verified that ambari server is not erroring in a loop which was causing it to become
unresponsive
> 
> Unit test results:
> 
> Results :
> 
> Tests run: 3518, Failures: 0, Errors: 0, Skipped: 28
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message