cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Podkowinski (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12653) In-flight shadow round requests
Date Tue, 14 Feb 2017 10:02:41 GMT


Stefan Podkowinski commented on CASSANDRA-12653:

bq. Stefan Podkowinski - is there some deeper purpose of moving the FD.instance.isAlive()
check higher in MigrationTask#runMayThrow() method beyond "check to see if it's dead before
we bother checking to see if it's worth sending a migration task"? Is there a reason we don't
let MM#shouldPullSchemaFrom return false if FD says the instance is dead?

We could move FS.isAlive into MM.shouldPullSchemaFrom, yes. Not totally against it, but the
log message in MigrationTask in case of a false return value would have to be changed and
actually the isAlive status should only be relevant at task execution, as there's a 60 second
delay after submitting it. So in theory you could submit a task for a node that has been dead
but will be alive again at time of execution.

bq. Given that the shadow round is meant to just get ring state without changing anything,
should we add an explicit check to MigrationManager#scheduleSchemaPull() to ensure that Gossiper.instance.isInShadowRound()
is false before scheduling?

The MigrationManager should never issue a schema pull during shadow round. If we add such
check, I'd prefer to throw an exception instead of failing silently, instead of letting the
process run in an undefined state. On the other hand, it's not really the business of the
MM to monitor the gossiper life-cycle when it comes to separations of concerns.

> In-flight shadow round requests
> -------------------------------
>                 Key: CASSANDRA-12653
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Stefan Podkowinski
>            Assignee: Stefan Podkowinski
>            Priority: Minor
>             Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>         Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
> Bootstrapping or replacing a node in the cluster requires to gather and check some host
IDs or tokens by doing a gossip "shadow round" once before joining the cluster. This is done
by sending a gossip SYN to all seeds until we receive a response with the cluster state, from
where we can move on in the bootstrap process. Receiving a response will call the shadow round
done and calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state again.
> The issue here is that at this point there might be other in-flight requests and it's
very likely that shadow round responses from other seeds will be received afterwards, while
the current state of the bootstrap process doesn't expect this to happen (e.g. gossiper may
or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round reply except
the first. Tasks might or might not execute based on whether at execution time {{Gossiper.resetEndpointStateMap}}
had been called, which effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}}
at start of the task. You'll see error log messages such as follows when this happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 - InetAddress /xx.xx.xx.xx
is now UP
> ERROR [MigrationStage:1]    2016-09-08 08:36:39,255 - unknown
endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, but it would
be good to get a second opinion (feel free to close as "wont fix").
> /cc [~Stefania] [~thobbs]

This message was sent by Atlassian JIRA

View raw message