cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Remi Bergsma <>
Subject Re: Blameless post mortem
Date Mon, 28 Sep 2015 11:49:36 GMT

There are two VR related issues left:
- CLOUDSTACK-8697: Assign VPC Internal LB rule to a VM fails
- CLOUDSTACK-8915: Cannot SSH into VMs deployed Redundant VPC routers

The first one has been tested today and seems still present. The second we discovered this
weekend while testing. It was broken by a recent PR and we’ll try to fix it and prove with
tests that everything still works. Wilder and I will focus on these issues.

As for the other blockers, I believe Rajani works on CLOUDSTACK-8808 and René has sent a

Regards, Remi

On 28/09/15 13:29, "Sebastien Goasguen" <> wrote:

>> On Sep 28, 2015, at 1:14 PM, Remi Bergsma <> wrote:
>> Hi Bharat,
>> There is only one way to prove a feature works: with tests. That’s why I say actually
_running_ the tests we have today on any new PR, is the most important thing. Having no documentation
is a problem, I agree, but it is not more important IMHO. If we had the documentation, we
still would have issues if nobody runs the tests and verifies pull requests. Documentation
that is perfect does not automatically lead to perfect implementation. So we need tests to
>> If we don’t agree that is also fine. We need to do both anyway and I think we do
agree on that.
>Also we need to move forward. We should have a live chat once 4.6 is out to discuss all
issues/problems and iron out the process.
>But reverting the VR refactor is not going to happen. There was ample discussions on the
PR when it was submitted, there was time to review and raise concerns at that time. It went
through quite a few reviews, tests etc…Maybe the documentation is not good, but the time
to raise this concern I am afraid was six months ago. We can learn from it, but we are not
going to revert it, this would not go cleanly as David mentioned.
>So let’s get back to blockers for 4.6, are there still VR related issues with master
>> Regards,
>> Remi
>> On 28/09/15 12:15, "Bharat Kumar" <> wrote:
>>> Hi Remi,
>>> i do not agree with “There is no bigger problem”  part of your reply. so
I had to repeat myself to make it more clear, Not because i am not aware of what this thread
is supposed to do.
>>> Regards,
>>> Bharat.
>>> On 28-Sep-2015, at 2:51 pm, Remi Bergsma <>
>>>> Hi Bharat,
>>>> I understand your frustrations but we already agreed on this so no need to
repeat. This thread is supposed to list some improvements and learn from it. Your point has
been taken so let’s move on.
>>>> We need documentation first, then do a change after which all tests should
pass. Even better is we write (missing) tests before changing stuff so you know they pass
before and after the fact. 
>>>> When doing reviews, feel free to ask for design documentation if you feel
it is needed.
>>>> Regards, Remi
>>>> On 28/09/15 11:02, "Bharat Kumar" <> wrote:
>>>>> Hi Remi,
>>>>> I never intended to say that we should not run tests, but even before
tests we should have proper documentation. My concern was if a major change is being introduced
it should be properly documented. All the issues which we are trying to fix are majorly due
to VR refactor. If there was a proper documentation for this we could
>>>>> have fixed this in a better way.  Even to add tests we need to understand
how a particular thing works and what data dose it expect. I think while fixing the python
based code changes this is where most of the people are facing issues. A proper documentation
will help in understanding these in a better way.
>>>>> Thanks,
>>>>> Bharat.
>>>>> On 28-Sep-2015, at 1:57 pm, Remi Bergsma <>
>>>>>> Hi Bharat,
>>>>>> There is no bigger problem. We should always run the tests and if
we find a case that isn’t currently covered by the tests we should simply add tests for
it. There’s no way we’ll get a stable master without them. The fact that they may not
cover everything, is no reason to not rely on them. If a feature is not important enough to
write a test for, then the feature is probably not important anyway. And if it is, then add
a test :-)
>>>>>> I do agree on the design documentation requirement for any (major?)
change. I found some design documentations on the subject you mention, but it should have
been more detailed. 
>>>>>> Regards,
>>>>>> Remi
>>>>>> On 28/09/15 09:58, "Bharat Kumar" <>
>>>>>>> Hi Remi,
>>>>>>> Thank you for the Blame less postmortem. 
>>>>>>> I think there is a bigger problem here than just the review process
and running tests. Even if we run the tests we cannot be sure that every thing will work as
intended. The tests will only give some level of confidence. The tests may not cover all the
use cases.
>>>>>>> I think the problem here is that the way major changes to the
code base are dealt with. For example,  VR refactoring was done without discussing the design
implications and the amount of changes it would bring in. I could not find any design document.
The vr refactor changed a lot of code and the way VR used to work and in my opinion it was
incomplete-vpn, isolated networks, basic networks, iptable rules and rvr in isolated case
etc were not implemented. Most of us are still in the process of understanding this. Even
before reaching this state we had to spend a lot of time fixing issues mentioned in the thread
[Blocker/Critical] VR related Issues.  
>>>>>>> When a change of this magnitude is being made, we should call
out all the changes and document them properly. This will help people to create better fixes.
Currently when we attempt to fix the isolated vr case it is effecting the vpc and vice versa.
for example pr 738 fixed it for vpc networks but broke it for isolated case. I believe it
is not too late to at least start documenting the changes now.
>>>>>>> Thanks,
>>>>>>> Bharat.
>>>>>>> On 28-Sep-2015, at 10:52 am, Sanjeev N <>
>>>>>>>> I have a concern here. Some of us are actively involved in
reviewing the
>>>>>>>> PRs related to marvin tests(Enhancing existing tests/Adding
new tests). If
>>>>>>>> we have to test a PR it requires an environment to be created
with actual
>>>>>>>> resources and this is going to take lot of time. Some of
the tests can run
>>>>>>>> on simulator but most of the tests require real hardware
to test. PR
>>>>>>>> submitter is already testing and submitting the test results
along with the
>>>>>>>> PR. So is it require to test these PRs by reviewers?
>>>>>>>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <>
>>>>>>>>> Remi, thanks for the detailed post-mortem, it's a good
read and great
>>>>>>>>> learning.
>>>>>>>>> I hope everyone reads it.
>>>>>>>>> The one thing to emphasize is that we now have a very
visible way to get
>>>>>>>>> code into master, we have folks investing time to provide
review (great),
>>>>>>>>> we need the submitters to make due diligence and answer
all comments in the
>>>>>>>>> reviews.
>>>>>>>>> In another project i work on, nothing can be added to
the code without
>>>>>>>>> unit tests. I think we could go down the route of asking
for new
>>>>>>>>> integration tests and unit tests for anything. If not,
the PR does not get
>>>>>>>>> merged. But let's digest your post-mortem and we can
discuss after 4.6.0.
>>>>>>>>> I see that you reverted one commit that was not made
by you, that's great.
>>>>>>>>> Let's focus on the blockers now, everything else can
>>>>>>>>> The big bonus of doing what we are doing is that once
4.6.0 is out, we can
>>>>>>>>> merge PRs again (assuming they are properly rebased and
tested) and we can
>>>>>>>>> release 4.6.1 really quickly after.
>>>>>>>>> -sebastien
>>>>>>>>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma <>
>>>>>>>>> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> This mail is intended to be blameless. We need to
learn something from
>>>>>>>>> it. That's why I left out who exactly did what because
it’s not relevant.
>>>>>>>>> There are multiple examples but it's about the why. Let's
learn from this
>>>>>>>>> without blaming anyone.
>>>>>>>>>> We know we need automated testing. We have integration
tests, but we are
>>>>>>>>> unable to run all of them on any Pull Request we receive.
If we would have
>>>>>>>>> that in place, it'd be much easier to spot errors, regression
and so on.
>>>>>>>>> It'd also be more rewarding to write more tests.
>>>>>>>>>> Unfortunately we're not there yet. So, we need to
do something else
>>>>>>>>> instead until we get there. If we do nothing, we know
we have many issues
>>>>>>>>> because a master that breaks on a regular basis is the
most frustrating
>>>>>>>>> things. We said we'd use Pull Requests with at least
two humans to review
>>>>>>>>> and give their OK for a Pull Request. In the form of
LGTM: Looks Good To
>>>>>>>>> Me. Ok, so the LGTMs are there because we have no automated
testing. Keep
>>>>>>>>> that in mind. You are supposed to replace automated testing
until it's
>>>>>>>>> there.
>>>>>>>>>> Since we do this, master got a lot more stable. But
every now and then
>>>>>>>>> we still have issues. Let's look at how we do manual
reviews. Again, this
>>>>>>>>> is not to blame anyone. It's to open our eyes and make
us realise what
>>>>>>>>> we're doing and what results we get out of that.
>>>>>>>>>> Example Pull Request #784:
>>>>>>>>>> Title: CLOUDSTACK-8799 fixed the default routes
>>>>>>>>>> That's nice, it has a Jira id and a short description
(as it should be).
>>>>>>>>>> The first person comes along and makes a comment:
>>>>>>>>>> "There was also an issue with VPC VRs" ... "Have
you seen this issue?
>>>>>>>>> Does your change affects the VPC VR (single/redundant)?"
>>>>>>>>>> Actually a good question. Unfortunaly there comes
no answer. After a
>>>>>>>>> reminder, it was promised to do tests against VPC networks.
>>>>>>>>>> The Jenkins builds both succeed and also Travis is
green. But how much
>>>>>>>>> value does this have? They have the impression to do
automated testing, and
>>>>>>>>> although you could argue they do, it's far from complete.
If it breaks, you
>>>>>>>>> know you have an issue. But it doesn’t work the other
way around.
>>>>>>>>>> Back to our example PR. In the mean time, another
commit gets pushed to
>>>>>>>>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if
you look at the Jira
>>>>>>>>> issue, you see it is about redundant virtual routers.
The non-VPC ones. So
>>>>>>>>> this is vague at best. But a reviewer gives a LGTM because
the person could
>>>>>>>>> create a VPC. That doesn't have anything to do with the
problem being fixed
>>>>>>>>> in this PR nor with the comments made earlier. But, at
least the person
>>>>>>>>> said what he did and we should all do that. What nobody
knew back then, was
>>>>>>>>> that this broke the default route on VPCs.
>>>>>>>>>> Then something strange happens: the two commits from
the PR end up on
>>>>>>>>> master as direct commits. With just one LGTM and no verification
from the
>>>>>>>>> person commenting about the linked issue. This happened
on Friday September
>>>>>>>>> 11th.
>>>>>>>>>> That day 21 commits came in, from 7 Pull Request
and unfortunately also
>>>>>>>>> from some direct commits. We noticed the direct commits
and notified the
>>>>>>>>> list (
As a lot
>>>>>>>>> came in at the same time, it was decided not to revert
them. Looking back,
>>>>>>>>> we should have done it.
>>>>>>>>>> From this point on, VPCs were broken as they wouldn't
get a default
>>>>>>>>> route. So, no public internet access from VMs in VPC
tiers, no VPNs
>>>>>>>>> working, etc. This was mentioned to the list on Thursday
September 15th,
>>>>>>>>> after some chats and debugging going on over the weekend
>>>>>>>>>> Here we are, master is broken functionality wise
and new Pull Requests
>>>>>>>>> come in to fix blockers. But we cannot ever test their
proper working,
>>>>>>>>> because VPCs are broken in master and so also in the
PRs branched off of
>>>>>>>>> it. With or without change in the PR.
>>>>>>>>>> It starts to escalate as the days go by.
>>>>>>>>>> I’ll leave out the bit on how this frustrated people.
Although it’s good
>>>>>>>>> to know we do not want to be in this situation.
>>>>>>>>>> Eventually Wilder and I spent an evening and a day
working on a branch
>>>>>>>>> where we loaded 7 PRs on top of each other (all VR related)
only to find
>>>>>>>>> the VPC is still broken. It allowed us to zoom in and
find the default
>>>>>>>>> route was missing again. We said it worked 3 weeks before,
because the same
>>>>>>>>> tests that succeeded then, now were broken. We had already
fixed this in PR
>>>>>>>>> #738 on August 25 so were sure about it.
>>>>>>>>>> After some digging we could trace it back to Pull
Request #784. Imagine
>>>>>>>>> the feeling seeing your own comment there mentioning
the previous issue on
>>>>>>>>> the default gateways. Fair to say our human review process
clearly failed
>>>>>>>>> here. Many many hours were spent on this problem over
the past two weeks.
>>>>>>>>> Could we have prevented this from happening? I think
so, yes.
>>>>>>>>>> This example clearly shows why:
>>>>>>>>>> - we should use Pull Requests
>>>>>>>>>> It made the change visible: Great!
>>>>>>>>>> - we do reviews and ask for feedback
>>>>>>>>>> We got feedback and questions: Also great!
>>>>>>>>>> - we should always respond to feedback and verify
it is resolved, before
>>>>>>>>> merging
>>>>>>>>>> We need to improve here. Even with two reviewers
that say LGTM, we
>>>>>>>>> should still address any feedback before merging.
>>>>>>>>>> - we should have two humans doing a review
>>>>>>>>>> We need to improve here as well. Not one reviewer,
we need two. Really.
>>>>>>>>>> - we need to document why we say LGTM.
>>>>>>>>>> Another improvement. It’s nice to say LGTM, but
a review of only 4
>>>>>>>>> characters and nothing more is useless. We need to know
what was tested and
>>>>>>>>> how. Test results, screen shots or anything that shows
what's been
>>>>>>>>> verified. If you only reviewed the code, also fine but
at least say that.
>>>>>>>>> Then the next reviewer should do another type of review
to get the comlete
>>>>>>>>> picture. Remember you're replacing automated testing!
>>>>>>>>>> - we should always merge Pull Requests
>>>>>>>>>> We made it easy, merging is the de facto standard,
and it has even more
>>>>>>>>> benefits. You can trace commits back to their Pull Request
(and find all
>>>>>>>>> comments and discussion there: saves time, trust me).
It also allows for
>>>>>>>>> easier reverting of a Pull Request. We’ll see even
more benefits once 4.7
>>>>>>>>> is there. Although the intentions to merge the Pull Request
were there, it
>>>>>>>>> still didn't happen. We should always check before we
push. As a committer
>>>>>>>>> we just need to be sure.
>>>>>>>>>> - we need automated testing!
>>>>>>>>>> The sooner the better. It’s all about the missing
automated testing.
>>>>>>>>> After 4.6, we all need to focus on this. Saves a lot
of time. And
>>>>>>>>> frustrations.
>>>>>>>>>> We're doing final testing on PR #887 and will merge
it soon. From that
>>>>>>>>> point on we can look into new issues. Be aware that any
PR out there that
>>>>>>>>> was created after September 10 needs to be rebased with
current master
>>>>>>>>> (when #887 is merged). Without that, no serious testing
can be done.
>>>>>>>>>> Let's be careful what to land on master. I'll only
be merging Pull
>>>>>>>>> Requests that have had proper reviews with information
on what was tested.
>>>>>>>>> At least one reviewer needs to actually verify it works
(and show the rest
>>>>>>>>> of us). We simply cannot assume it will work.
>>>>>>>>>> If we do this, I think we can start resolving the
remaining blockers
>>>>>>>>> one-by-one and go into the first RC round. Please help
out where you can so
>>>>>>>>> we can make this a success together. Thanks!
>>>>>>>>>> Looking forward to the day we have our automated
testing in place ;-)
>>>>>>>>>> Regards,
>>>>>>>>>> Remi
View raw message