ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Nettleton" <rnettle...@hortonworks.com>
Subject Re: Review Request 37356: Fix synchronization on TopologyManager's outstanding requests
Date Wed, 12 Aug 2015 21:54:55 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/
-----------------------------------------------------------

(Updated Aug. 12, 2015, 9:54 p.m.)


Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, Robert Levas, and
Sumit Mohanty.


Changes
-------

Updated patch, based on review comments provided by jspeidel

The locking ordering between the two methods in question is now handled the same way.  The
lock ordering in the first version of the patch is still honored.  Version 2 of the patch
extends the lock on "availableHosts" to the boundaries mentioned by jspeidel in his comments.
 

I've tested this with a 23-node cluster, and verified that my test Blueprint deployed successfully.


Bugs: AMBARI-12720
    https://issues.apache.org/jira/browse/AMBARI-12720


Repository: ambari


Description
-------

This patch provides a fix for AMBARI-12720. 

Blueprint Cluster deployments are failing under load, with clusters of 50+ nodes.  One or
more hosts in the cluster fail to be registered, even though the ambari-agent registration
occurs properly for these hosts.  This problem seems to occur when certain hosts are registered
at roughly the same time as the call to TopologyManager.processRequest().  

Upon further analysis, it appears that the TopologyManager.outstandingRequests object is not
fully sychronized.  

There are two paths involved in reading/writing to this data structure:

1. TopologyManager.processRequest() (occurs on the main request thread handling the cluster
creation request).  When all known hosts have been processed, but some requests are pending,
the LogicalRequest instance is added to TopologyManager.outstandingRequests.  Note:  The call
to add this request is not synchronized on the outstandingRequests object, but is only synchronized
on TopologyManager.availableHosts. 
2. TopologyManager.onHostRegistered() (occurs on a separate thread that handles ambari-agent
registrations for new hosts joining the cluster).  This method does synchronize on TopologyManager.outstandingRequests,
and attempts to iterate over these requests to determine if the newly registered host can
satisfy the request.  

The incorrect synchronization on TopologyManager.outstandingRequests appears to be the root
of the intermittent cluster deployment problems described in AMBARI-12720.  Since this datastructure
is read/updated by both threads, and since this object is not protected by the same locks,
there appears to be a race condition in this code.  

To fix this problem, this patch implements the following:

1. Updates the TopologyManager.onHostRegistered() method to lock on
        --> availableHosts
            --> outstandingRequests
   prior to reading TopologyManager.outstandingRequests
   
2. Updates the TopologyManager.processRequest() method to lock on
        --> availableHosts
            --> outstandingRequests
   prior to adding a LogicalRequest instance to this collection
   

The locking order (availableHosts, outstandingRequests) must be kept in place in order to
avoid any potential deadlocks from making a change like this.


Diffs (updated)
-----

  ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java 4f51d47


Diff: https://reviews.apache.org/r/37356/diff/


Testing
-------

1. Ran the ambari-server unit test suite ("mvn clean test").  All tests are passing.
2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this change does not break
existing deployments.
3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make sure this change
does not break deployments of larger clusters.


Thanks,

Robert Nettleton


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message