ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Speidel" <jspei...@hortonworks.com>
Subject Re: Review Request 34821: Occasional database deadlock detected when provisioning cluster via blueprint api
Date Fri, 29 May 2015 20:06:27 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34821/
-----------------------------------------------------------

(Updated May 29, 2015, 8:06 p.m.)


Review request for Ambari, Robert Nettleton, Sumit Mohanty, Sid Wagle, and Tom Beerbower.


Bugs: AMBARI-11542
    https://issues.apache.org/jira/browse/AMBARI-11542


Repository: ambari


Description
-------

When provisioning a cluster via the blueprint api, occasionally a database deadlock is detected.
There is retry logic around this code so it doesn't affect the creation of the cluster and
a user wouldn't notice this unless they looked at the logs. That being said, this issue involves
incorrect transaction demarcation and synchronization and is potentially serious depending
on how it is manifested.

The fix involves changing the scope of the database transaction as well as synchronization.
There are currently many issues transaction/synchronization issues in the state layer that
need to be addresses, this only deals with this exact use case.

Also, this patch strictly deals with correctness and I didn't make an effort to optimize this
path.  If this results is a performance regression, there are several approaches that we could
take.


Diffs
-----

  ambari-server/src/main/java/org/apache/ambari/server/controller/AmbariManagementControllerImpl.java
792b6fe 

Diff: https://reviews.apache.org/r/34821/diff/


Testing (updated)
-------

Provison clusters many times via looking for a reported database deadlock.  Without this patch,
I was able to reproduce the deadlock fairly consistently and with the patch no deadlock occurred
across many installs.

Unit Tests:
- tx/synchronization change only so no new unit test
- currently running full unit test suite and will update review with result summary when completed


Results :

Tests run: 3020, Failures: 0, Errors: 0, Skipped: 21
...
Total run:744
Total errors:0
Total failures:0


Thanks,

John Speidel


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message