cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
Date Thu, 19 May 2016 13:05:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291052#comment-15291052
] 

Marcus Eriksson commented on CASSANDRA-11824:
---------------------------------------------

Problem occurs when the repair coordinator dies - then the repairing nodes won't clear out
the ParentRepairSessions

My approach is to have ActiveRepairService start listening for endpoint changes and failure
detector events, so for example:

* a cluster with A, B, C, we trigger repair against A.
* during repair, A dies
* B, C gets notified about this and marks the ParentRepairSession as failed.

It gets a bit tricky as node A might not have realized that it was down and just continues
with its repair, so we keep a 'failed' version of the parent repair session around for 24h
on B and C, so if anyone tries to get that (say node A continues sending validation requests
for example) we throw an exception which will fail the repair on node A as well

A dtest to reproduce the error:
https://github.com/krummas/cassandra-dtest/commits/marcuse/11824

||branch||testall||dtest||
|[marcuse/11824|https://github.com/krummas/cassandra/tree/marcuse/11824]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-dtest]|
|[marcuse/11824-2.2|https://github.com/krummas/cassandra/tree/marcuse/11824-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-2.2-dtest]|
|[marcuse/11824-3.0|https://github.com/krummas/cassandra/tree/marcuse/11824-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.0-dtest]|
|[marcuse/11824-3.7|https://github.com/krummas/cassandra/tree/marcuse/11824-3.7]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.7-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.7-dtest]|
|[marcuse/11824-trunk|https://github.com/krummas/cassandra/tree/marcuse/11824-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-trunk-dtest]|

should also note that this does not seem to fix CASSANDRA-11728

could you review [~yukim]?

> If repair fails no way to run repair again
> ------------------------------------------
>
>                 Key: CASSANDRA-11824
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11824
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: T Jake Luciani
>            Assignee: Marcus Eriksson
>              Labels: fallout
>             Fix For: 3.0.x
>
>
> I have a test that disables gossip and runs repair at the same time. 
> {quote}
> WARN  [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 StorageService.java:384
- Stopping gossip by operator request
> INFO  [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 Gossiper.java:1463
- Announcing shutdown
> INFO  [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 StorageService.java:1999
- Node /172.31.31.1 state jump to shutdown
> INFO  [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 OutboundTcpConnection.java:514
- Handshaking version with /172.31.17.32
> INFO  [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 OutboundTcpConnection.java:514
- Handshaking version with /172.31.24.76
> INFO  [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting repair command
#1, repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range:
false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [],
# of ranges: 3)
> INFO  [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting repair command
#2, repairing keyspace stresscql with repair options (parallelism: parallel, primary range:
false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [],
# of ranges: 3)
> INFO  [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting repair command
#3, repairing keyspace system_traces with repair options (parallelism: parallel, primary range:
false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [],
# of ranges: 2)
> {quote}
> This ends up failing:
> {quote}
> 16:54:44.844 INFO  serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] Starting
repair command #1, repairing keyspace keyspace1 with repair options (parallelism: parallel,
primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters:
[], hosts: [], # of ranges: 3)
> [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. List of failed
endpoint(s): [172.31.24.76, 172.31.17.32]
> [2016-05-17 16:57:21,945] null
> {quote}
> Subsequent calls to repair with all nodes up still fails:
> {quote}
> ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 CompactionManager.java:1193 - Cannot
start multiple repair sessions over the same sstables
> ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - Failed creating
a merkle tree for [repair #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message