cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Snyder <j...@code406.com>
Subject Re: Upgrade strategy for high number of nodes
Date Fri, 29 Nov 2019 18:22:10 GMT
Hello Shishir,

It shouldn't be necessary to take downtime to perform upgrades of a
Cassandra cluster. It sounds like the biggest issue you're facing is the
upgradesstables step. upgradesstables is not strictly necessary before a
Cassandra node re-enters the cluster to serve traffic; in my experience it
is purely for optimizing the performance of the database once the software
upgrade is complete. I recommend trying out an upgrade in a test
environment without using upgradesstables, which should bring the 5 hours
per node down to just a few minutes.

If you're running NetworkTopologyStrategy and you want to optimize further,
you could consider performing the upgrade on multiple nodes within the same
rack in parallel. When correctly configured, NetworkTopologyStrategy can
protect your database from an outage of an entire rack. So performing an
upgrade on a few nodes at a time within a rack is the same as a partial
rack outage, from the database's perspective.

Have a nice upgrade!

Josh

On Fri, Nov 29, 2019 at 7:22 AM Shishir Kumar <shishirroy2000@gmail.com>
wrote:

> Hi,
>
> Need input on cassandra upgrade strategy for below:
> 1. We have Datacenter across 4 geography (multiple isolated deployments in
> each DC).
> 2. Number of Cassandra nodes in each deployment is between 6 to 24
> 3. Data volume on each nodes between 150 to 400 GB
> 4. All production environment has DR set up
> 5. During upgrade we do not want downtime
>
> We are planning to go for stack upgrade but upgradesstables is taking
> approx. 5 hours per node (if data volume is approx 200 GB).
> Options-
> No downtime - As per recommendation (DataStax documentation) if we plan to
> upgrade one node at time I.e. in sequence upgrade cycle for one environment
> will take weeks, so DevOps concern.
> Read Only (No downtime) - Route read only load to DR system. We have
> resilience built up to take care of mutation scenarios. But incase it takes
> more than say 3-4 hours, there will be long catch up exercise. Maintenance
> cost seems too high due to unknowns
> Downtime- Can upgrade all nodes in parallel as no live customers. This has
> direct Customer impact, so need to convince on maintenance cost vs customer
> impact.
> Please suggest how other Organisation are solving this scenario (whom have
> 100+ nodes)
>
> Regards
> Shishir
>
>

Mime
View raw message