cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-13327) Pending endpoints size check for CAS doesn't play nicely with writes-on-replacement
Date Wed, 29 Mar 2017 17:54:41 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947607#comment-15947607
] 

Ariel Weisberg commented on CASSANDRA-13327:
--------------------------------------------

Give me a bit to package up my dtest demonstrating this.

> Pending endpoints size check for CAS doesn't play nicely with writes-on-replacement
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13327
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13327
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>
> Consider this ring:
> 127.0.0.1  MR UP     JOINING -7301836195843364181
> 127.0.0.2    MR UP     NORMAL -7263405479023135948
> 127.0.0.3    MR UP     NORMAL -7205759403792793599
> 127.0.0.4   MR DOWN     NORMAL -7148113328562451251
> where 127.0.0.1 was bootstrapping for cluster expansion. Note that, due to the failure
of 127.0.0.4, 127.0.0.1 was stuck trying to stream from it and making no progress.
> Then the down node was replaced so we had:
> 127.0.0.1  MR UP     JOINING -7301836195843364181
> 127.0.0.2    MR UP     NORMAL -7263405479023135948
> 127.0.0.3    MR UP     NORMAL -7205759403792793599
> 127.0.0.5   MR UP     JOINING -7148113328562451251
> It’s confusing in the ring - the first JOINING is a genuine bootstrap, the second is
a replacement. We now had CAS unavailables (but no non-CAS unvailables). I think it’s because
the pending endpoints check thinks that 127.0.0.5 is gaining a range when it’s just replacing.
> The workaround is to kill the stuck JOINING node, but Cassandra shouldn’t unnecessarily
fail these requests.
> It also appears like required participants is bumped by 1 during a host replacement so
if the replacing host fails you will get unavailables and timeouts.
> This is related to the check added in CASSANDRA-8346



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message