cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Bossa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5692) Race condition in detecting version on a mixed 1.1/1.2 cluster
Date Mon, 24 Jun 2013 12:53:22 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691941#comment-13691941
] 

Sergio Bossa commented on CASSANDRA-5692:
-----------------------------------------

Given the first message which should setup the version is sent along the same connection,
this patch doesn't actually work, causing two 1.2 nodes to block each other during bootstrap.

So I'm attaching a different patch, which implements a simple handshake by assuming version
6 and trying to read the actual version on a different thread, so that it can be interrupted
(disconnected) and can retry the handshake until one of the following happens:
1) The version is confirmed to be >= 6, and the handshake succeeds.
2) The version is an old one, hence it is expected to be found among the tracked versions
when the first gossip message is received.

Sorry for all the different patches, but the implementation details of all the version exchange
machinery turned out to be quite subtle.
                
> Race condition in detecting version on a mixed 1.1/1.2 cluster
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-5692
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5692
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.9, 1.2.5
>            Reporter: Sergio Bossa
>            Priority: Minor
>         Attachments: 5692-0001.patch, 5692-0004.patch, 5692-0005.patch
>
>
> On a mixed 1.1 / 1.2 cluster, starting 1.2 nodes fires sometimes a race condition in
version detection, where the 1.2 node wrongly detects version 6 for a 1.1 node.
> It works as follows:
> 1) The just started 1.2 node quickly opens an OutboundTcpConnection toward a 1.1 node
before receiving any messages from the latter.
> 2) Given the version is correctly detected only when the first message is received, the
version is momentarily set at 6.
> 3) This opens an OutboundTcpConnection from 1.2 to 1.1 at version 6, which gets stuck
in the connect() method.
> Later, the version is correctly fixed, but all outbound connections from 1.2 to 1.1 are
stuck at this point.
> Evidence from 1.2 logs:
> TRACE 13:48:31,133 Assuming current protocol version for /127.0.0.2
> DEBUG 13:48:37,837 Setting version 5 for /127.0.0.2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message