cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Minh Do (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6619) Race condition issue during upgrading 1.1 to 1.2
Date Mon, 27 Jan 2014 22:11:37 GMT


Minh Do commented on CASSANDRA-6619:

As posted in other tickets, 1.1 and 1.2 have different message protocols.  Hence, it is important
to set the right target version when making outbound connections rather than depending on
the inbound connections to set a version value.  Thus, race condition in setting the version
values is solved.

Attachment is the patch to make sure the code does that when an outbound connection is open
and an exchange for versioning information in the hankshake fails.

As discussed with Jason Brown here at Netflix, we came up with a solution that during the
upgrade, the upgraded nodes have in the environment the variable cassandra.prev_version =
5 (for 1.1.7 or 4 for 1.1) to help out the handshakes in a mixed version cluster.

Once a cluster is fully upgraded to 1.2, cassadra.prev_version is removed from all nodes'
environment and a C* rolling restart across nodes is required.  This step ensures that the
new patch won't penalize the 1.2 cluster where all outbound connections are from 1.2 to 1.2.


> Race condition issue during upgrading 1.1 to 1.2
> ------------------------------------------------
>                 Key: CASSANDRA-6619
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Minh Do
>            Assignee: Minh Do
>            Priority: Minor
>             Fix For: 1.2.14
> There is a race condition during upgrading a C* 1.1x cluster to C* 1.2.
> One issue is that OutboundTCPConnection can't establish from a 1.2 node to some 1.1x
nodes.  Because of this, a live cluster during the upgrading will suffer in high read latency
and be unable to fulfill some write requests.  It won't be a problem if there is a small cluster
but it is a problem in a large cluster (100+ nodes) because the upgrading process takes 10+
hours to 1+ day(s) to complete.
> Acknowledging about CASSANDRA-5692, however, it is not fully fixed.  We already have
a patch for this and will attach shortly for feedback.

This message was sent by Atlassian JIRA

View raw message