cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Bailey (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1216) removetoken drops node from ring before re-replicating its data is finished
Date Fri, 27 Aug 2010 19:06:57 GMT


Nick Bailey commented on CASSANDRA-1216:

After some more thinking I think there are two problems here.

 * The timeout for waiting on a stream to complete - An arbitrary timeout here is not the
right way to do this. What we really need is the concept of stream progress. We should be
able to verify that a stream is progressing or not and based on that retry it.  CASSANDRA-1438
kind of relates to this problem and could be modified to implement this.  

 * The timeout waiting for nodes to confirm replication - Ideally there could be no timeout
here. The problem though is if a node that should be grabbing data goes down permanently,
removeToken will wait forever.  I think it's reasonable to have some sort of timeout in this
case. A log message/error can indicate which machines were being waited on for replication.
An administrator should know if that machine went down or is still streaming. That will determine
if repair needs to be run.  The alternative to this I guess would be periodically waking up
and checking that the nodes we are waiting on are still alive.  That wouldn't be particularly
hard to implement

I don't think returning immediately from the call is the right approach.  That is part of
the reason why this ticket is created. In the case that replication fails somewhere, there
is no feedback to the user.  At least timing out eventually provides information about which
machines we think failed to replicate data.  

As far as multiple remove calls and the coordinator going down.  I think there should be a
'force' option in the case the coordinator goes down and you believe the rest of the nodes
completed the operation.  To prevent multiple calls to removeToken there should just be a
check to make sure the coordinator is dead before another call can be performed.

So besides those few changes above, I think we should either implement this part way with
a time out for stream replication or postpone completion here until we add the concept of
stream progress.

> removetoken drops node from ring before re-replicating its data is finished
> ---------------------------------------------------------------------------
>                 Key: CASSANDRA-1216
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7 beta 1
>            Reporter: Jonathan Ellis
>            Assignee: Nick Bailey
>             Fix For: 0.7 beta 2
>         Attachments: 0001-Add-callbacks-to-streaming.patch, 0002-Modify-removeToken-to-be-similar-to-decommission.patch,
0003-Fixes-to-old-tests.patch, 0004-Additional-tests-for-removeToken.patch
> this means that if something goes wrong during the re-replication (e.g. a source node
is restarted) there is (a) no indication that anything has gone wrong and (b) no way to restart
the process (other than the Big Hammer of running repair)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message