cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-13327) Pending endpoints size check for CAS doesn't play nicely with writes-on-replacement
Date Wed, 29 Mar 2017 17:28:41 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947564#comment-15947564
] 

Paulo Motta commented on CASSANDRA-13327:
-----------------------------------------

I'm a bit confused here, did CAS unavailable happened while 127.0.0.5 was being replaced,
or did 127.0.0.5 remained in JOINING/replacing state after replace was finished?

If the former, than this is expected behavior I guess? Because you had 3 normal endpoints
(where 1 was down) and 2 pending endpoints (the bootstrapping and the replacing node) for
the requested key so the CAS should not allowed due to CASSANDRA-8346.

If the latter than this is a bug with replace and must be fixed since the node must bump to
NORMAL state after replacement is completed and the CAS should succeed. Can you reproduce
this easily or have logs to understand why the replacement node did not go into NORMAL state
after replacement was finished?

> Pending endpoints size check for CAS doesn't play nicely with writes-on-replacement
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13327
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13327
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>
> Consider this ring:
> 127.0.0.1  MR UP     JOINING -7301836195843364181
> 127.0.0.2    MR UP     NORMAL -7263405479023135948
> 127.0.0.3    MR UP     NORMAL -7205759403792793599
> 127.0.0.4   MR DOWN     NORMAL -7148113328562451251
> where 127.0.0.1 was bootstrapping for cluster expansion. Note that, due to the failure
of 127.0.0.4, 127.0.0.1 was stuck trying to stream from it and making no progress.
> Then the down node was replaced so we had:
> 127.0.0.1  MR UP     JOINING -7301836195843364181
> 127.0.0.2    MR UP     NORMAL -7263405479023135948
> 127.0.0.3    MR UP     NORMAL -7205759403792793599
> 127.0.0.5   MR UP     JOINING -7148113328562451251
> It’s confusing in the ring - the first JOINING is a genuine bootstrap, the second is
a replacement. We now had CAS unavailables (but no non-CAS unvailables). I think it’s because
the pending endpoints check thinks that 127.0.0.5 is gaining a range when it’s just replacing.
> The workaround is to kill the stuck JOINING node, but Cassandra shouldn’t unnecessarily
fail these requests.
> It also appears like required participants is bumped by 1 during a host replacement so
if the replacing host fails you will get unavailables and timeouts.
> This is related to the check added in CASSANDRA-8346



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message