cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rene Moser <m...@renemoser.net>
Subject [BLOCKER] CLOUDSTACK-8848
Date Sat, 26 Sep 2015 12:09:57 GMT
I discovered the race condition bug related to CLOUDSTACK-8848 while 
testing in our lab and daan started a PR 
https://github.com/apache/cloudstack/pull/829 for discussion.

But it turned out to be a dead end discussion. Daan and I started a 
debug session on Friday a week ago and we discovered the real problem, 
but it was unclear how it can be solved. Daan was off from the next day on.

After another discussion with @anshul1886 started at 
https://github.com/apache/cloudstack/pull/829#issuecomment-141613687 he 
brought me to the solution I created in 
https://github.com/apache/cloudstack/pull/885.

The related comment from ashul:

 >From code it seems to be getting updated and DB also suggests that.
 >It will not be updated if there is no power change for 
 >MAX_CONSECUTIVE_SAME_STATE_UPDATE_COUNT. But that is to reduce DB 
 >transactions and will not create issues as it is updated if there is 
 >change in power state.

This means all the calculation of how to handle a missing power state is 
related to an outdated DB record due DB transaction optimization.

My change makes sure if we detected a outdated record, we reset the 
counter to make sure we get new state updates.

In the worst case (if the VM is really missing), the handling of missing 
state updates is postponed to the next missingStateReport. So to me, 
this is really a safe way to fix this issue.

I patched our lab environment, where we discovered the race condition in 
the first place and we didn't see the bug happened again.

You can find the logs here https://github.com/apache/cloudstack/pull/885 
attached to the PR.

It isn't easy to test it, I learned when to start a VR migration to hit 
the race condition. So that is why I write this message to show you I 
tested it in real world conditions.

Yours
resmo


Mime
View raw message