ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hurley" <jhur...@hortonworks.com>
Subject Re: Review Request 37356: Fix synchronization on TopologyManager's outstanding requests
Date Tue, 11 Aug 2015 19:04:52 GMT


> On Aug. 11, 2015, 1:46 p.m., Jonathan Hurley wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java,
line 169
> > <https://reviews.apache.org/r/37356/diff/1/?file=1037718#file1037718line169>
> >
> >     Isn't this just asking for a deadlock? All it takes is a change in a different
method that causes the locks to be taken out of order.
> >     
> >     Why not just use a single lock to control access to this manager? The contention
would only be there during automated blueprint deployment, so it won't affect user experience
at all.
> 
> Robert Nettleton wrote:
>     Thanks for the review.  While using nested locks incorrectly can result in a deadlock,
I believe the ordering of the two locks in question is correct.
>     
>     That being said, it is certainly an issue that any future change in this class could
result in a deadlock.
>     
>     My goal with this patch was just to resolve a particular bug.  I am hestitant to
introduce a new locking mechanism into the TopologyManager as a bug fix.  
>     
>     It may be wiser in the future to move to a model where we synchronize access at the
level of the TopologyManager, but that is not currently the case, and would likely involve
a larger/riskier change than I'd like to make for this bug fix.  There are at least 4 separate
locks that are used in this code, sometimes in nested fashion, and I don't think this should
be re-written in a bug fix, but likely tackled in a future release.  
>     
>     If you'd like, I can file a JIRA to track an investigation on this issue in Ambari
2.2.  Sound ok?

+1 - Thanks Bob. I agree that in its current form, you have the logs in an order which should
not cause a deadlock.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/#review94948
-----------------------------------------------------------


On Aug. 11, 2015, 11:30 a.m., Robert Nettleton wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37356/
> -----------------------------------------------------------
> 
> (Updated Aug. 11, 2015, 11:30 a.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, Robert Levas,
and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-12720
>     https://issues.apache.org/jira/browse/AMBARI-12720
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> This patch provides a fix for AMBARI-12720. 
> 
> Blueprint Cluster deployments are failing under load, with clusters of 50+ nodes.  One
or more hosts in the cluster fail to be registered, even though the ambari-agent registration
occurs properly for these hosts.  This problem seems to occur when certain hosts are registered
at roughly the same time as the call to TopologyManager.processRequest().  
> 
> Upon further analysis, it appears that the TopologyManager.outstandingRequests object
is not fully sychronized.  
> 
> There are two paths involved in reading/writing to this data structure:
> 
> 1. TopologyManager.processRequest() (occurs on the main request thread handling the cluster
creation request).  When all known hosts have been processed, but some requests are pending,
the LogicalRequest instance is added to TopologyManager.outstandingRequests.  Note:  The call
to add this request is not synchronized on the outstandingRequests object, but is only synchronized
on TopologyManager.availableHosts. 
> 2. TopologyManager.onHostRegistered() (occurs on a separate thread that handles ambari-agent
registrations for new hosts joining the cluster).  This method does synchronize on TopologyManager.outstandingRequests,
and attempts to iterate over these requests to determine if the newly registered host can
satisfy the request.  
> 
> The incorrect synchronization on TopologyManager.outstandingRequests appears to be the
root of the intermittent cluster deployment problems described in AMBARI-12720.  Since this
datastructure is read/updated by both threads, and since this object is not protected by the
same locks, there appears to be a race condition in this code.  
> 
> To fix this problem, this patch implements the following:
> 
> 1. Updates the TopologyManager.onHostRegistered() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to reading TopologyManager.outstandingRequests
>    
> 2. Updates the TopologyManager.processRequest() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to adding a LogicalRequest instance to this collection
>    
> 
> The locking order (availableHosts, outstandingRequests) must be kept in place in order
to avoid any potential deadlocks from making a change like this.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
8e7fb7f 
> 
> Diff: https://reviews.apache.org/r/37356/diff/
> 
> 
> Testing
> -------
> 
> 1. Ran the ambari-server unit test suite ("mvn clean test").  All tests are passing.
> 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this change does not
break existing deployments.
> 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make sure this
change does not break deployments of larger clusters.
> 
> 
> Thanks,
> 
> Robert Nettleton
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message