cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Remi Bergsma <RBerg...@schubergphilis.com>
Subject Re: [DISCUSS] VR upgrade downtime reduction
Date Tue, 06 Feb 2018 13:47:21 GMT
Hi Daan,

In my opinion the biggest issue is the fact that there are a lot of different code paths:
VPC versus non-VPC, VPC versus redundant-VPC, etc. That's why you cannot simply switch from
a single VPC to a redundant VPC for example. 

For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC with a single tier
and made sure all features are supported. Next we merged the single and redundant VPC code
paths. The idea here is that redundancy or not should only be a difference in the number of
routers. Code should be the same. A single router, is also "master" but there just is no "backup".

That simplifies things A LOT, as keepalived is now the master of the whole thing. No more
assigning ip addresses in Python, but leave that to keepalived instead. Lots of code deleted.
Easier to maintain, way more stable. We just released Cosmic 6 that has this feature and are
now rolling it out in production. Looking good so far. This change unlocks a lot of possibilities,
like live upgrading from a single VPC to a redundant one (and back). In the end, if the redundant
VPC is rock solid, you most likely don't even want single VPCs any more. But that will come.

As I said, we're rolling this out as we speak. In a few weeks when everything is upgraded
I can share what we learned and how well it works. CloudStack could use a similar approach.
 
Kind Regards,
Remi



On 05/02/2018, 16:44, "Daan Hoogland" <daan.hoogland@gmail.com> wrote:

    H devs,
    
    I have recently (re-)submitted two PRs, one by Wei [1] and one by Remi [2],
    that reduce downtime for redundant routers and redundant VPCs respectively.
    (please review those)
    Now from customers we hear that they also want to reduce downtime for
    regular VRs so as we discussed this we came to two possible solutions that
    we want to implement one of:
    
    1. start and configure a new router before destroying the old one and then
    as a last minute action stop the old one.
    2. make all routers start up redundancy services but for regular routers
    start only one until an upgrade is required at which time a new, second
    router can be started before killing the old one.​
    
    ​obviously both solutions have their merits, so I want to have your input
    to make the broadest supported implementation.
    -1 means there will be an overlap or a small delay and interruption of
    service.
    +1 It can be argued, "they got what they payed for".
    -2 means a overhead in memory usage by the router by the extra services
    running on it.
    +2 the number of router-varieties will be further reduced.
    
    -1&-2 We have to deal with potentially large upgrade steps from way before
    the cloudstack era even and might be stuck to 1 because of that, needing to
    hack around it. Any dealing with older VRs, pre 4.5 and especially pre 4.0
    will be hard.
    
    I am not cross posting though this might be one of these occasions where it
    is appropriate to include users@. Just my puristic inhibitions.
    
    Of course I have preferences but can you share your thoughts, please?
    ​
    ​And don't forget to review Wei's [1] and Remi's [2] work please.
    
    ​[1] https://github.com/apache/cloudstack/pull/2435​
    [2] https://github.com/apache/cloudstack/pull/2436
    
    -- 
    Daan
    

Mime
View raw message