cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremiah Jordan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5692) Race condition in detecting version on a mixed 1.1/1.2 cluster
Date Tue, 15 Oct 2013 06:28:44 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794929#comment-13794929
] 

Jeremiah Jordan commented on CASSANDRA-5692:
--------------------------------------------

This does not seem to be fixed (or there is another race condition) as of 1.2.10.  Just saw
this happen that during an upgrade 10 node cluster, 5 in each DC.  There were 6 nodes, 3 in
each DC, seeing 4 nodes, 2 in each DC as the wrong version.  This was causing timeout failures,
and describe cluster failures (only from the nodes seen as being on the wrong version).  Restarting
the "wrong version" nodes didn't fix anything.  We had to restart the 6 nodes to get them
to re-detect version, and then things started working.

> Race condition in detecting version on a mixed 1.1/1.2 cluster
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-5692
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5692
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.9, 1.2.5
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Minor
>             Fix For: 1.2.7, 2.0 beta 1
>
>         Attachments: 5692-0005.patch, 5692-0006.patch
>
>
> On a mixed 1.1 / 1.2 cluster, starting 1.2 nodes fires sometimes a race condition in
version detection, where the 1.2 node wrongly detects version 6 for a 1.1 node.
> It works as follows:
> 1) The just started 1.2 node quickly opens an OutboundTcpConnection toward a 1.1 node
before receiving any messages from the latter.
> 2) Given the version is correctly detected only when the first message is received, the
version is momentarily set at 6.
> 3) This opens an OutboundTcpConnection from 1.2 to 1.1 at version 6, which gets stuck
in the connect() method.
> Later, the version is correctly fixed, but all outbound connections from 1.2 to 1.1 are
stuck at this point.
> Evidence from 1.2 logs:
> TRACE 13:48:31,133 Assuming current protocol version for /127.0.0.2
> DEBUG 13:48:37,837 Setting version 5 for /127.0.0.2



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message