cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
Date Mon, 20 Feb 2017 06:06:44 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joel Knighton updated CASSANDRA-12653:
--------------------------------------
    Status: Awaiting Feedback  (was: Open)

A few questions/comments on the latest patches:
* On all versions, a space is missing in the conditional {{if(firstSynSendAt == 0)}}
* On 2.2, it looks like the patch adds the {{valuesEqual}} method from later versions. Was
this intentional? It looks unused.
* On all versions, using {{firstSynSendAt == 0}} to check if it has been initialized isn't
entirely safe. It's entirely legal (although admittedly rare) for {{System.nanoTime}} to return
0. If this happened, all acks would be rejected.
* Comparisons for two {{System.nanoTime}} values (such as in {{GossipDigestAckVerbHandler}})
should not use t1 < t2. Instead, one should check the difference (t1 - t2 < 0) because
numerical overflow could occur in the {{System.nanoTime}} long.
* In {{maybeFinishShadowRound}}/{{finishShadowRound}}, we should add the states to the {{endpointShadowStateMap}}
before setting {{inShadowRound}} to false. It looks like the current behavior admits a race
where {{doShadowRound}} could read {{inShadowRound == false}} and exit its loop and copy the
endpointShadowStateMap before it is filled by shadow round finish.
* I believe {{firstSynSendAt}} is accessed from multiple threads and needs to be volatile.

> In-flight shadow round requests
> -------------------------------
>
>                 Key: CASSANDRA-12653
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Stefan Podkowinski
>            Assignee: Stefan Podkowinski
>            Priority: Minor
>             Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
>         Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check some host
IDs or tokens by doing a gossip "shadow round" once before joining the cluster. This is done
by sending a gossip SYN to all seeds until we receive a response with the cluster state, from
where we can move on in the bootstrap process. Receiving a response will call the shadow round
done and calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state again.
> The issue here is that at this point there might be other in-flight requests and it's
very likely that shadow round responses from other seeds will be received afterwards, while
the current state of the bootstrap process doesn't expect this to happen (e.g. gossiper may
or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round reply except
the first. Tasks might or might not execute based on whether at execution time {{Gossiper.resetEndpointStateMap}}
had been called, which effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}}
at start of the task. You'll see error log messages such as follows when this happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - InetAddress /xx.xx.xx.xx
is now UP
> ERROR [MigrationStage:1]    2016-09-08 08:36:39,255 FailureDetector.java:223 - unknown
endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, but it would
be good to get a second opinion (feel free to close as "wont fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message