cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastien Goasguen <>
Subject Re: Blameless post mortem
Date Mon, 28 Sep 2015 12:35:16 GMT
Folks let’s chill for a second here,

Let’s be pragmatic:


- Master got unstable with lots of issues related to the VPC
- Issues were fixed 
- Let’s go back to blockers, fix and release 4.6


- We have a postmortem from Remi.
- Let’s talk it out, first with the folks that will be in Dublin or in a separate thread.
- Then live with a planned on-line hangout of sorts
- I believe that for now our focus should be on 4.6


- We are trying a new process for 4.6. Which is to stabilize master and release from there
- This is a big change compared to developing on master which is what we used to do.
- We are embracing github PR
- PR are the place to ask for questions on code, ask for more tests etc..
- We know we need more tests and more docs, we have known that forever
- The VPC refactor started a long time back, maybe I am confused but it seems that any VPC
related add-on will have to work with this refactor.

So please, no name calling, this is against ASF policy, we have thick skins but let’s keep
it cordial and get our eyes back on the ball
ie. What’s the latest BVT result ? What are the blockers etc ?


> On Sep 28, 2015, at 2:19 PM, Bharat Kumar <> wrote:
> Dude,
> There was nothing friendly about the postmortem you did, It was only partial, we should
do a complete postmortem and then draw conclusions.
> I think the post-mortems like this are of no use if we do not do them completely.
> Regards,
> Bharat.
> On 28-Sep-2015, at 5:39 pm, Remi Bergsma <> wrote:
>> Dude, this is the final friendly email about his. All points have been made in previous
mails. This has nothing to do with ‘blameless’ and ‘learning’ anymore.
>> Read Seb’s mail. We will move on now.
>> Regards, Remi
>> On 28/09/15 13:54, "Bharat Kumar" <> wrote:
>>> Hi Remi,
>>> Whatever  ever we think  we have discovered are all well known best practices
while developing code in community. 
>>> I agree that tests need to be run on a new PR,  but i wonder why was this ignored
when merging the VR refactor code. Perhaps we will uncover some more issues if we investigate
this. I believe 
>>> one of the reasons for this is the complexity and incomplete nature of the vr
refactor change and failing to identify the areas which got effected. If we had a good documentation
i think we cloud have understood the areas that were getting 
>>> impacted early on, areas like the vpn ,  iptables, isolated networks rvr   etc
 and run the relevant tests. The documentation will also help us focus on these areas while
reviewing  and fixing subsequent issues. Currently no one knows the areas that got effected

>>> due to the vr refactor change, we are seeing issues all over the code.  I think
this is a bigger problem than what we have discussed so far.
>>> I think presently we should stop fixing all the vr refactoring  bugs until we
come up with a  proper document describing the VR refactoring  changes.
>>> I am not suggesting that we should revert the vr refactor code, I am willing
to work on this and fix the issues,  I am only asking if we can get some documentation.
>>> Regards,
>>> Bharat.
>>> On 28-Sep-2015, at 4:59 pm, Sebastien Goasguen <> wrote:
>>>>> On Sep 28, 2015, at 1:14 PM, Remi Bergsma <>
>>>>> Hi Bharat,
>>>>> There is only one way to prove a feature works: with tests. That’s
why I say actually _running_ the tests we have today on any new PR, is the most important
thing. Having no documentation is a problem, I agree, but it is not more important IMHO. If
we had the documentation, we still would have issues if nobody runs the tests and verifies
pull requests. Documentation that is perfect does not automatically lead to perfect implementation.
So we need tests to verify.
>>>>> If we don’t agree that is also fine. We need to do both anyway and
I think we do agree on that.
>>>> Also we need to move forward. We should have a live chat once 4.6 is out
to discuss all issues/problems and iron out the process.
>>>> But reverting the VR refactor is not going to happen. There was ample discussions
on the PR when it was submitted, there was time to review and raise concerns at that time.
It went through quite a few reviews, tests etc…Maybe the documentation is not good, but
the time to raise this concern I am afraid was six months ago. We can learn from it, but we
are not going to revert it, this would not go cleanly as David mentioned.
>>>> So let’s get back to blockers for 4.6, are there still VR related issues
with master ?
>>>>> Regards,
>>>>> Remi
>>>>> On 28/09/15 12:15, "Bharat Kumar" <> wrote:
>>>>>> Hi Remi,
>>>>>> i do not agree with “There is no bigger problem”  part of your
reply. so I had to repeat myself to make it more clear, Not because i am not aware of what
this thread is supposed to do.
>>>>>> Regards,
>>>>>> Bharat.
>>>>>> On 28-Sep-2015, at 2:51 pm, Remi Bergsma <>
>>>>>>> Hi Bharat,
>>>>>>> I understand your frustrations but we already agreed on this
so no need to repeat. This thread is supposed to list some improvements and learn from it.
Your point has been taken so let’s move on.
>>>>>>> We need documentation first, then do a change after which all
tests should pass. Even better is we write (missing) tests before changing stuff so you know
they pass before and after the fact. 
>>>>>>> When doing reviews, feel free to ask for design documentation
if you feel it is needed.
>>>>>>> Regards, Remi
>>>>>>> On 28/09/15 11:02, "Bharat Kumar" <>
>>>>>>>> Hi Remi,
>>>>>>>> I never intended to say that we should not run tests, but
even before tests we should have proper documentation. My concern was if a major change is
being introduced it should be properly documented. All the issues which we are trying to fix
are majorly due to VR refactor. If there was a proper documentation for this we could
>>>>>>>> have fixed this in a better way.  Even to add tests we need
to understand how a particular thing works and what data dose it expect. I think while fixing
the python based code changes this is where most of the people are facing issues. A proper
documentation will help in understanding these in a better way.
>>>>>>>> Thanks,
>>>>>>>> Bharat.
>>>>>>>> On 28-Sep-2015, at 1:57 pm, Remi Bergsma <>
>>>>>>>>> Hi Bharat,
>>>>>>>>> There is no bigger problem. We should always run the
tests and if we find a case that isn’t currently covered by the tests we should simply add
tests for it. There’s no way we’ll get a stable master without them. The fact that they
may not cover everything, is no reason to not rely on them. If a feature is not important
enough to write a test for, then the feature is probably not important anyway. And if it is,
then add a test :-)
>>>>>>>>> I do agree on the design documentation requirement for
any (major?) change. I found some design documentations on the subject you mention, but it
should have been more detailed. 
>>>>>>>>> Regards,
>>>>>>>>> Remi
>>>>>>>>> On 28/09/15 09:58, "Bharat Kumar" <>
>>>>>>>>>> Hi Remi,
>>>>>>>>>> Thank you for the Blame less postmortem. 
>>>>>>>>>> I think there is a bigger problem here than just
the review process and running tests. Even if we run the tests we cannot be sure that every
thing will work as intended. The tests will only give some level of confidence. The tests
may not cover all the use cases.
>>>>>>>>>> I think the problem here is that the way major changes
to the code base are dealt with. For example,  VR refactoring was done without discussing
the design implications and the amount of changes it would bring in. I could not find any
design document. The vr refactor changed a lot of code and the way VR used to work and in
my opinion it was incomplete-vpn, isolated networks, basic networks, iptable rules and rvr
in isolated case etc were not implemented. Most of us are still in the process of understanding
this. Even before reaching this state we had to spend a lot of time fixing issues mentioned
in the thread [Blocker/Critical] VR related Issues.  
>>>>>>>>>> When a change of this magnitude is being made, we
should call out all the changes and document them properly. This will help people to create
better fixes. Currently when we attempt to fix the isolated vr case it is effecting the vpc
and vice versa. for example pr 738 fixed it for vpc networks but broke it for isolated case.
I believe it is not too late to at least start documenting the changes now.
>>>>>>>>>> Thanks,
>>>>>>>>>> Bharat.
>>>>>>>>>> On 28-Sep-2015, at 10:52 am, Sanjeev N <>
>>>>>>>>>>> I have a concern here. Some of us are actively
involved in reviewing the
>>>>>>>>>>> PRs related to marvin tests(Enhancing existing
tests/Adding new tests). If
>>>>>>>>>>> we have to test a PR it requires an environment
to be created with actual
>>>>>>>>>>> resources and this is going to take lot of time.
Some of the tests can run
>>>>>>>>>>> on simulator but most of the tests require real
hardware to test. PR
>>>>>>>>>>> submitter is already testing and submitting the
test results along with the
>>>>>>>>>>> PR. So is it require to test these PRs by reviewers?
>>>>>>>>>>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <>
>>>>>>>>>>>> Remi, thanks for the detailed post-mortem,
it's a good read and great
>>>>>>>>>>>> learning.
>>>>>>>>>>>> I hope everyone reads it.
>>>>>>>>>>>> The one thing to emphasize is that we now
have a very visible way to get
>>>>>>>>>>>> code into master, we have folks investing
time to provide review (great),
>>>>>>>>>>>> we need the submitters to make due diligence
and answer all comments in the
>>>>>>>>>>>> reviews.
>>>>>>>>>>>> In another project i work on, nothing can
be added to the code without
>>>>>>>>>>>> unit tests. I think we could go down the
route of asking for new
>>>>>>>>>>>> integration tests and unit tests for anything.
If not, the PR does not get
>>>>>>>>>>>> merged. But let's digest your post-mortem
and we can discuss after 4.6.0.
>>>>>>>>>>>> I see that you reverted one commit that was
not made by you, that's great.
>>>>>>>>>>>> Let's focus on the blockers now, everything
else can wait.
>>>>>>>>>>>> The big bonus of doing what we are doing
is that once 4.6.0 is out, we can
>>>>>>>>>>>> merge PRs again (assuming they are properly
rebased and tested) and we can
>>>>>>>>>>>> release 4.6.1 really quickly after.
>>>>>>>>>>>> -sebastien
>>>>>>>>>>>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>> This mail is intended to be blameless.
We need to learn something from
>>>>>>>>>>>> it. That's why I left out who exactly did
what because it’s not relevant.
>>>>>>>>>>>> There are multiple examples but it's about
the why. Let's learn from this
>>>>>>>>>>>> without blaming anyone.
>>>>>>>>>>>>> We know we need automated testing. We
have integration tests, but we are
>>>>>>>>>>>> unable to run all of them on any Pull Request
we receive. If we would have
>>>>>>>>>>>> that in place, it'd be much easier to spot
errors, regression and so on.
>>>>>>>>>>>> It'd also be more rewarding to write more
>>>>>>>>>>>>> Unfortunately we're not there yet. So,
we need to do something else
>>>>>>>>>>>> instead until we get there. If we do nothing,
we know we have many issues
>>>>>>>>>>>> because a master that breaks on a regular
basis is the most frustrating
>>>>>>>>>>>> things. We said we'd use Pull Requests with
at least two humans to review
>>>>>>>>>>>> and give their OK for a Pull Request. In
the form of LGTM: Looks Good To
>>>>>>>>>>>> Me. Ok, so the LGTMs are there because we
have no automated testing. Keep
>>>>>>>>>>>> that in mind. You are supposed to replace
automated testing until it's
>>>>>>>>>>>> there.
>>>>>>>>>>>>> Since we do this, master got a lot more
stable. But every now and then
>>>>>>>>>>>> we still have issues. Let's look at how we
do manual reviews. Again, this
>>>>>>>>>>>> is not to blame anyone. It's to open our
eyes and make us realise what
>>>>>>>>>>>> we're doing and what results we get out of
>>>>>>>>>>>>> Example Pull Request #784:
>>>>>>>>>>>>> Title: CLOUDSTACK-8799 fixed the default
>>>>>>>>>>>>> That's nice, it has a Jira id and a short
description (as it should be).
>>>>>>>>>>>>> The first person comes along and makes
a comment:
>>>>>>>>>>>>> "There was also an issue with VPC VRs"
... "Have you seen this issue?
>>>>>>>>>>>> Does your change affects the VPC VR (single/redundant)?"
>>>>>>>>>>>>> Actually a good question. Unfortunaly
there comes no answer. After a
>>>>>>>>>>>> reminder, it was promised to do tests against
VPC networks. Great!
>>>>>>>>>>>>> The Jenkins builds both succeed and also
Travis is green. But how much
>>>>>>>>>>>> value does this have? They have the impression
to do automated testing, and
>>>>>>>>>>>> although you could argue they do, it's far
from complete. If it breaks, you
>>>>>>>>>>>> know you have an issue. But it doesn’t
work the other way around.
>>>>>>>>>>>>> Back to our example PR. In the mean time,
another commit gets pushed to
>>>>>>>>>>>> it: "CLOUDSTACK-8799 fixed for vpc networks."
But if you look at the Jira
>>>>>>>>>>>> issue, you see it is about redundant virtual
routers. The non-VPC ones. So
>>>>>>>>>>>> this is vague at best. But a reviewer gives
a LGTM because the person could
>>>>>>>>>>>> create a VPC. That doesn't have anything
to do with the problem being fixed
>>>>>>>>>>>> in this PR nor with the comments made earlier.
But, at least the person
>>>>>>>>>>>> said what he did and we should all do that.
What nobody knew back then, was
>>>>>>>>>>>> that this broke the default route on VPCs.
>>>>>>>>>>>>> Then something strange happens: the two
commits from the PR end up on
>>>>>>>>>>>> master as direct commits. With just one LGTM
and no verification from the
>>>>>>>>>>>> person commenting about the linked issue.
This happened on Friday September
>>>>>>>>>>>> 11th.
>>>>>>>>>>>>> That day 21 commits came in, from 7 Pull
Request and unfortunately also
>>>>>>>>>>>> from some direct commits. We noticed the
direct commits and notified the
>>>>>>>>>>>> list (
As a lot
>>>>>>>>>>>> came in at the same time, it was decided
not to revert them. Looking back,
>>>>>>>>>>>> we should have done it.
>>>>>>>>>>>>> From this point on, VPCs were broken
as they wouldn't get a default
>>>>>>>>>>>> route. So, no public internet access from
VMs in VPC tiers, no VPNs
>>>>>>>>>>>> working, etc. This was mentioned to the list
on Thursday September 15th,
>>>>>>>>>>>> after some chats and debugging going on over
the weekend (
>>>>>>>>>>>>> Here we are, master is broken functionality
wise and new Pull Requests
>>>>>>>>>>>> come in to fix blockers. But we cannot ever
test their proper working,
>>>>>>>>>>>> because VPCs are broken in master and so
also in the PRs branched off of
>>>>>>>>>>>> it. With or without change in the PR.
>>>>>>>>>>>>> It starts to escalate as the days go
>>>>>>>>>>>>> I’ll leave out the bit on how this
frustrated people. Although it’s good
>>>>>>>>>>>> to know we do not want to be in this situation.
>>>>>>>>>>>>> Eventually Wilder and I spent an evening
and a day working on a branch
>>>>>>>>>>>> where we loaded 7 PRs on top of each other
(all VR related) only to find
>>>>>>>>>>>> the VPC is still broken. It allowed us to
zoom in and find the default
>>>>>>>>>>>> route was missing again. We said it worked
3 weeks before, because the same
>>>>>>>>>>>> tests that succeeded then, now were broken.
We had already fixed this in PR
>>>>>>>>>>>> #738 on August 25 so were sure about it.
>>>>>>>>>>>>> After some digging we could trace it
back to Pull Request #784. Imagine
>>>>>>>>>>>> the feeling seeing your own comment there
mentioning the previous issue on
>>>>>>>>>>>> the default gateways. Fair to say our human
review process clearly failed
>>>>>>>>>>>> here. Many many hours were spent on this
problem over the past two weeks.
>>>>>>>>>>>> Could we have prevented this from happening?
I think so, yes.
>>>>>>>>>>>>> This example clearly shows why:
>>>>>>>>>>>>> - we should use Pull Requests
>>>>>>>>>>>>> It made the change visible: Great!
>>>>>>>>>>>>> - we do reviews and ask for feedback
>>>>>>>>>>>>> We got feedback and questions: Also great!
>>>>>>>>>>>>> - we should always respond to feedback
and verify it is resolved, before
>>>>>>>>>>>> merging
>>>>>>>>>>>>> We need to improve here. Even with two
reviewers that say LGTM, we
>>>>>>>>>>>> should still address any feedback before
>>>>>>>>>>>>> - we should have two humans doing a review
>>>>>>>>>>>>> We need to improve here as well. Not
one reviewer, we need two. Really.
>>>>>>>>>>>>> - we need to document why we say LGTM.
>>>>>>>>>>>>> Another improvement. It’s nice to say
LGTM, but a review of only 4
>>>>>>>>>>>> characters and nothing more is useless. We
need to know what was tested and
>>>>>>>>>>>> how. Test results, screen shots or anything
that shows what's been
>>>>>>>>>>>> verified. If you only reviewed the code,
also fine but at least say that.
>>>>>>>>>>>> Then the next reviewer should do another
type of review to get the comlete
>>>>>>>>>>>> picture. Remember you're replacing automated
>>>>>>>>>>>>> - we should always merge Pull Requests
>>>>>>>>>>>>> We made it easy, merging is the de facto
standard, and it has even more
>>>>>>>>>>>> benefits. You can trace commits back to their
Pull Request (and find all
>>>>>>>>>>>> comments and discussion there: saves time,
trust me). It also allows for
>>>>>>>>>>>> easier reverting of a Pull Request. We’ll
see even more benefits once 4.7
>>>>>>>>>>>> is there. Although the intentions to merge
the Pull Request were there, it
>>>>>>>>>>>> still didn't happen. We should always check
before we push. As a committer
>>>>>>>>>>>> we just need to be sure.
>>>>>>>>>>>>> - we need automated testing!
>>>>>>>>>>>>> The sooner the better. It’s all about
the missing automated testing.
>>>>>>>>>>>> After 4.6, we all need to focus on this.
Saves a lot of time. And
>>>>>>>>>>>> frustrations.
>>>>>>>>>>>>> We're doing final testing on PR #887
and will merge it soon. From that
>>>>>>>>>>>> point on we can look into new issues. Be
aware that any PR out there that
>>>>>>>>>>>> was created after September 10 needs to be
rebased with current master
>>>>>>>>>>>> (when #887 is merged). Without that, no serious
testing can be done.
>>>>>>>>>>>>> Let's be careful what to land on master.
I'll only be merging Pull
>>>>>>>>>>>> Requests that have had proper reviews with
information on what was tested.
>>>>>>>>>>>> At least one reviewer needs to actually verify
it works (and show the rest
>>>>>>>>>>>> of us). We simply cannot assume it will work.
>>>>>>>>>>>>> If we do this, I think we can start resolving
the remaining blockers
>>>>>>>>>>>> one-by-one and go into the first RC round.
Please help out where you can so
>>>>>>>>>>>> we can make this a success together. Thanks!
>>>>>>>>>>>>> Looking forward to the day we have our
automated testing in place ;-)
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Remi

View raw message