cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoffrey Yu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-9876) One way targeted repair
Date Wed, 10 Aug 2016 01:51:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414586#comment-15414586
] 

Geoffrey Yu edited comment on CASSANDRA-9876 at 8/10/16 1:50 AM:
-----------------------------------------------------------------

Thanks for the quick review! I’ve attached a new patch that addresses your comments, with
the exception of one of them for which I wanted to get some more feedback first.

I also attached a patch that adds one dtest to test the pull repair. It works nearly identically
to the token range repair with the exception that it asserts that one of the nodes only sends
data and the other only receives.

{quote}
I don't think it's necessary to make specifying --start-token and --end-token mandatory, since
if that is not specified it will just pull repair all common ranges between specified hosts.
{quote}

The reason why I added in the check for a token range was that the repair code as it is now
doesn’t actually add only the common ranges between the specified hosts. I wasn’t sure
if this is was the intended behavior or a bug.

To replicate the issue, just create a 3 node cluster, add a keyspace with replication factor
2, and run a regular repair through nodetool on that keyspace with exactly two nodes specified.

The reason it happens is that if no ranges are specified, the repair will [add all ranges
on the local node|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L3137].
Then when we hit {{RepairRunnable}}, we try to [find a list of neighbors for each range|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairRunnable.java#L160-L162].

The problem here is that it isn’t always true that every range the local node owns is also
owned by the remote node we specified through the nodetool command. In the example above,
only one range will be common between any two nodes. Because of this the [check here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/ActiveRepairService.java#L246-L251]
may result in an exception being thrown, which aborts the repair.

If this is intended behavior, then forcing the user to specify a token range that is common
between the nodes prevents that exception from being thrown. Otherwise the error message,
“Repair requires at least two endpoints that are neighbours before it can continue” can
be confusing to the operator since the two specified nodes may actually share a common range.
What do you think?


was (Author: geoffxy):
Thanks for the quick review! I’ve attached a new patch that addresses your comments, with
the exception of one of them for which I wanted to get some more feedback first.

I also attached a patch that adds one dtest to test the pull repair. It works nearly identically
to the token range repair with the exception that it asserts that one of the nodes only sends
data and the other only receives.

{quote}
I don't think it's necessary to make specifying --start-token and --end-token mandatory, since
if that is not specified it will just pull repair all common ranges between specified hosts.
{quote}

The reason why I added in the check for a token range was that the repair code as it is now
doesn’t actually add only the common ranges between the specified hosts. I wasn’t sure
if this is was the intended behavior or a bug.

To replicate the issue, just create a 3 node cluster, add a keyspace with replication factor
2, and run a regular repair through nodetool on that keyspace with exactly two nodes specified.

The reason it happens is that if no ranges are specified, the repair will [add all ranges
on the local node|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L3137].
Then when we hit {{RepairRunnable}}, we try to find a list of neighbors for each range (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairRunnable.java#L160-L162).

The problem here is that it isn’t always true that every range the local node owns is also
owned by the remote node we specified through the nodetool command. In the example above,
only one range will be common between any two nodes. Because of this the [check here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/ActiveRepairService.java#L246-L251]
 may result in an exception being thrown, which aborts the repair.

If this is intended behavior, then forcing the user to specify a token range that is common
between the nodes prevents that exception from being thrown. Otherwise the error message,
“Repair requires at least two endpoints that are neighbours before it can continue” can
be confusing to the operator since the two specified nodes may actually share a common range.
What do you think?

> One way targeted repair
> -----------------------
>
>                 Key: CASSANDRA-9876
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9876
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Geoffrey Yu
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: 9876-dtest-master.txt, 9876-trunk-v2.txt, 9876-trunk.txt
>
>
> Many applications use C* by writing to one local DC. The other DC is used when the local
DC is unavailable. When the local DC becomes available, we want to run a targeted repair b/w
one endpoint from each DC to minimize the data transfer over WAN.  In this case, it will be
helpful to do a one way repair in which data will only be streamed from other DC to local
DC instead of streaming the data both ways. This will further minimize the traffic over WAN.
This feature should only be supported if a targeted repair is run involving 2 hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message