cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Durity, Sean R" <>
Subject RE: [EXTERNAL] Re: Upgrade strategy for high number of nodes
Date Mon, 02 Dec 2019 15:22:09 GMT
All my upgrades are without downtime for the application. Yes, do the binary upgrade one node
at a time. Then run upgradesstables on as many nodes as your app load can handle (maybe you
can point the app to a different DC, while another DC is doing upgradesstables). Upgradesstables
doesn’t cause downtime – it just increases the IO load on the nodes executing the upgradesstables.
I try to get it done as quickly as possible, because I suspend streaming operations (repairs,
etc.) until the sstable rewrites are completed.

Sean Durity

From: Shishir Kumar <>
Sent: Saturday, November 30, 2019 1:00 AM
Subject: [EXTERNAL] Re: Upgrade strategy for high number of nodes

Thanks for pointer. We haven't much changed data model since long, so before workarounds (scrub)
worth understanding root cause of problem.
This might be reason why running upgradesstables in parallel was not recommended.
On Sat, 30 Nov 2019, 10:37 Jeff Jirsa, <<>>
Scrub really shouldn’t be required here.

If there’s ever a step that reports corruption, it’s either a very very old table where
you dropped columns previously or did something “wrong” in the past or a software bug.
The old dropped column really should be obvious in the stack trace - anything else deserves
a bug report.

It’s unfortunate that people jump to just scrubbing the unreadable data - would appreciate
an anonymized JIRA if possible. Alternatively work with your vendor to make sure they don’t
have bugs in their readers somehow.

On Nov 29, 2019, at 8:58 PM, Shishir Kumar <<>>

Some more background. We are planning (tested) binary upgrade across all nodes without downtime.
As next step running upgradesstables. As C* file format and version (from format big, version
mc to format bti, version aa (Refer
- upgrade from DSE 5.1 to 6.x). Underlying changes explains why it takes too much time to
Running  upgradesstables  in parallel across RAC - This is where I am not sure on impact of
running in parallel (document recommends to run one node at time). During upgradesstables
there are scenario's where it report file corruption, hence require corrective step I.e. scrub.
Due to file corruption at times nodes goes down due to sstable corruption or result in high
CPU usage ~100%. Performing above in parallel without downtime might result in more inconsistency
across nodes. This scenario have not tested, so will need group help in case they have done
similar upgrade in past (I.e. scenario's/complexity which needs to be considered and why guideline
recommend to run upgradesstable one node at time).

On Fri, Nov 29, 2019 at 11:52 PM Josh Snyder <<>>
Hello Shishir,

It shouldn't be necessary to take downtime to perform upgrades of a Cassandra cluster. It
sounds like the biggest issue you're facing is the upgradesstables step. upgradesstables is
not strictly necessary before a Cassandra node re-enters the cluster to serve traffic; in
my experience it is purely for optimizing the performance of the database once the software
upgrade is complete. I recommend trying out an upgrade in a test environment without using
upgradesstables, which should bring the 5 hours per node down to just a few minutes.

If you're running NetworkTopologyStrategy and you want to optimize further, you could consider
performing the upgrade on multiple nodes within the same rack in parallel. When correctly
configured, NetworkTopologyStrategy can protect your database from an outage of an entire
rack. So performing an upgrade on a few nodes at a time within a rack is the same as a partial
rack outage, from the database's perspective.

Have a nice upgrade!


On Fri, Nov 29, 2019 at 7:22 AM Shishir Kumar <<>>

Need input on cassandra upgrade strategy for below:
1. We have Datacenter across 4 geography (multiple isolated deployments in each DC).
2. Number of Cassandra nodes in each deployment is between 6 to 24
3. Data volume on each nodes between 150 to 400 GB
4. All production environment has DR set up
5. During upgrade we do not want downtime

We are planning to go for stack upgrade but upgradesstables is taking approx. 5 hours per
node (if data volume is approx 200 GB).
No downtime - As per recommendation (DataStax documentation) if we plan to upgrade one node
at time I.e. in sequence upgrade cycle for one environment will take weeks, so DevOps concern.
Read Only (No downtime) - Route read only load to DR system. We have resilience built up to
take care of mutation scenarios. But incase it takes more than say 3-4 hours, there will be
long catch up exercise. Maintenance cost seems too high due to unknowns
Downtime- Can upgrade all nodes in parallel as no live customers. This has direct Customer
impact, so need to convince on maintenance cost vs customer impact.
Please suggest how other Organisation are solving this scenario (whom have 100+ nodes)



The information in this Internet Email is confidential and may be legally privileged. It is
intended solely for the addressee. Access to this Email by anyone else is unauthorized. If
you are not the intended recipient, any disclosure, copying, distribution or any action taken
or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed
to our clients any opinions or advice contained in this Email are subject to the terms and
conditions expressed in any applicable governing The Home Depot terms of business or client
engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy
and content of this attachment and for any damages or losses arising from any inaccuracies,
errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature,
which may be contained in this attachment and shall not be liable for direct, indirect, consequential
or special damages in connection with this e-mail message or its attachment.
View raw message