cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
Date Fri, 25 Jul 2014 11:55:39 GMT


Marcus Eriksson commented on CASSANDRA-7560:

In general, LGTM, nit; would be nice with some javadoc on the failureCallback param to sendRR(..)
and on sendRRWithFailure(..)

Btw, I think "MessageIn.isFailureCallback()" is a bit confusing, would it make sense to rename
that to something like "doCallbackOnFailure()"?

> 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
> ----------------------------------------------------------------------
>                 Key: CASSANDRA-7560
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Vladimir Avram
>            Assignee: Yuki Morishita
>             Fix For: 2.0.10
>         Attachments: 0001-backport-CASSANDRA-6747.patch, cassandra_daemon.log, cassandra_daemon_rep1.log,
cassandra_daemon_rep2.log, nodetool_command.log
> Running {{nodetool repair -pr}} will sometimes hang on one of the resulting AntiEntropySessions.
> The system logs will show the repair command starting
> {noformat}
>  INFO [Thread-3079] 2014-07-15 02:22:56,514 (line 2569) Starting
repair command #1, repairing 256 ranges for keyspace x
> {noformat}
> You can then see a few AntiEntropySessions completing with:
> {noformat}
> INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 (line 282) [repair
#eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed successfully
> {noformat}
> Finally we reach an AntiEntropySession at some point that hangs just before requesting
the merkle trees for the next column family in line for repair. So we first see the previous
CF being finished and the whole repair sessions hangs here with no visible progress or errors
on this or any of the related nodes.
> {noformat}
> INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 (line 221) [repair
#8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully synced
> {noformat}
> Notes:
> * Single DC 6 node cluster with an average load of 86 GB per node.
> * This appears to be random; it does not always happen on the same CF or on the same

This message was sent by Atlassian JIRA

View raw message