ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Speidel" <jspei...@hortonworks.com>
Subject Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes
Date Tue, 26 May 2015 20:10:44 GMT


> On May 26, 2015, 7:50 p.m., Robert Levas wrote:
> > Ship It!

Forgot to add test results:
Results :

Tests run: 3011, Failures: 0, Errors: 0, Skipped: 21
...
----------------------------------------------------------------------
Total run:743
Total errors:0
Total failures:0


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/#review85239
-----------------------------------------------------------


On May 26, 2015, 7:42 p.m., John Speidel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34677/
> -----------------------------------------------------------
> 
> (Updated May 26, 2015, 7:42 p.m.)
> 
> 
> Review request for Ambari, Robert Nettleton and Tom Beerbower.
> 
> 
> Bugs: AMBARI-11394
>     https://issues.apache.org/jira/browse/AMBARI-11394
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Provisioning a cluster may occasionally fail to complete as a result of an out of order
database write.
> This error presents itself as start task(s) that never progresses beyond the PENDING
state. For these logical pending tasks, there are no associated physical tasks.
> When a host is matched to a host request, an install request is submitted followed immediately
by a start request. The install task transitions all host components desired_state for the
host from INIT to INSTALLED. But, because of an error in the persistence layer, after the
desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler
thread) to INIT. As a result, the component is never started because it it's desired state
is INIT and isn't processed by the start operation.
> The root cause of this is that the public method ServiceComponentHostImpl.handleEvent()
is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT
because this method is marked as @Transactional it's invocation is wrapped in a proxy which
wraps the method invocation in a transaction. As a result, the transaction is committed in
the proxy after the method returns outside of any synchronization which allows for out of
order writes.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java
dd06eb5 
> 
> Diff: https://reviews.apache.org/r/34677/diff/
> 
> 
> Testing
> -------
> 
> - provisioned clusters via BP
> - currently re-running unit test suite and will update with results prior to merging
> 
> Because this is a timing issue which according to a user only occurs for them once every
~150 clusters and I have been unable to reproduce, I wan't able to verify that this patch
completely fixes this issue.  But, I can say with certainty that this the issue that was fixed
could manifest itself precisely as the bug describes.
> 
> 
> Thanks,
> 
> John Speidel
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message