ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hurley" <jhur...@hortonworks.com>
Subject Review Request 30357: Deadlock Between Dependent Cluster/Service/Component/Host Implementations
Date Wed, 28 Jan 2015 17:38:24 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30357/
-----------------------------------------------------------

Review request for Ambari, Alejandro Fernandez, Mahadev Konar, Nate Cole, and Tom Beerbower.


Bugs: AMBARI-9368
    https://issues.apache.org/jira/browse/AMBARI-9368


Repository: ambari


Description
-------

A deadlock scenario was identified between ServiceComponentImpl and ServiceComponentHostImpl.
The root cause was the pattern of locking that is used in ClusterImpl, ServiceImpl, ServiceComponentImpl
and ServiceComponentHostImpl. Essentially, our pattern of using 2 locks per implementation
is, in most cases, unnecessary and deadlock prone. 

ClusterImpl:
I've gone through and removed the internal lock on the cluster. The cluster is the highest
level of logical organization and should only have 1 lock (hence the reason for a cluster
global lock). It made no sense to keep another internal lock that either locked in parallel
or contradicted the global lock. Furthermore, methods on primary keys such as clusterId don't
even need to lock since their values won't change.

ServiceImpl:
Locks around static data or primary key data such as service names, cluster IDs, etc, should
not have locks around them since they will never change. The clusterGlobalLock was removed
from many methods where the JPA entity of the service was being written to; a cluster lock
should not be used unless the state of the cluster is changing or being retrieved.

ServiceComponentImpl:
Locks around static data or primary key data such as service names, cluster IDs, etc, should
not have locks around them since they will never change. In many cases, the clusterGlobalLock
was removed for the same reasons as stated in ServiceImpl.

ServiceComponentHostImpl
Locks around static data or primary key data such as service names, cluster IDs, etc, should
not have locks around them since they will never change. In many cases, the clusterGlobalLock
was removed for the same reasons as stated in ServiceImpl.

Overall, the global cluster lock was left on methods that alter the topology of the cluster
(add/delete of a service/component/host) or when refreshing the internal entity. The cluster
global lock, however, once the cluster topology has changed. The internal locks were left
in place where there was work being done around a non-thread-safe member or when an internal
entity was having non-primary-key data set.


Diffs
-----

  ambari-server/src/main/java/org/apache/ambari/server/state/ServiceComponentImpl.java a31e42c

  ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java 85e8314 
  ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java 0bfaa19

  ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java
dbdef2b 
  ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java
PRE-CREATION 

Diff: https://reviews.apache.org/r/30357/diff/


Testing
-------

Installed a small (5 node) and large (20 node) cluster and performed an end-to-end install
with all services. In both cases, I had two browsers opened so there were interwoven requests
during installations. 

In the small cluster, I forced a yum repository error on install so that I would need to retry
the failed install. The cluster deleted and the services were repopulated on the retry.

After both clusters were installed, I execute an Add Service from the browser at the same
time as a DELETE service request from a 2nd browser. In both cases, the requests were scheduled
correctly and the services were propagated.

Testing also included adding new hosts, and adding new host components from 2 different browers
simultaneously. 

New tests were added to cover threads performing read serialization on the impl's concurrently
with impl creation and writes. These tests deadlocked before the changes.

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 29:21 min
[INFO] Finished at: 2015-01-28T01:36:06-05:00
[INFO] Final Memory: 36M/443M
[INFO] ------------------------------------------------------------------------


Thanks,

Jonathan Hurley


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message