cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Boudreault (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-8316) "Did not get positive replies from all endpoints" error on incremental repair
Date Tue, 25 Nov 2014 17:44:14 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224896#comment-14224896
] 

Alan Boudreault edited comment on CASSANDRA-8316 at 11/25/14 5:43 PM:
----------------------------------------------------------------------

[~krummas] [~llambiel] I have been able to reproduce this issue many times. However, not sure
exactly where the problem is. I've attached a small bash script that I used to test:

* I run a cluster of 8 nodes
* I set the cluster loglevel to TRACE (very important to reproduce the issue, we have to slow
things)
* I stress with n=500000 with RF=3
* I start 3 "nodetool repair -par -inc" in parallel  (Important, starting only 2 repairs doesn't
produce the issue everytime. it depends on the system load I guess)
* The issue is related to incremental repair. I can't reproduce the issue with seq repair
and/or par not incremental repair.
* The compaction strategy is not related. I can reproduce the issue with LCS and STCS.

Basically, the issue happens when the system is very busy and doesn't response enough fast.
In then function sendRR of MessagingService.java, a callback is added withn a timeout of 10
seconds. The endpoint doesn't response in 10 seconds, so we get the error. However, even if
we increase that timeout to 100 seconds in example, the system doesn't get better and the
load is still very high. We just get the error message "Lost notification, check server log
for repair state of keyspace ..." instead of "Repair failed with error Did not get positive
replies from all endpoints.". When the load is high (event after the repair), I checked quickly
with yourkit and what taking a lot of cpu time is the AntiEntropyStage thread, so the ActiveRepairService
that never ends?

Let me know if you I go deeper in the profiling, perhaps I could get a better profiling by
enabling a cassandra agent + yourkit.



was (Author: aboudreault):
[~krummas] [~llambiel] I have been able to reproduce this issue many times. However, not sure
exactly where the problem is. I've attached a small bash script that I used to test:

* I run a cluster of 8 nodes
* I set the cluster loglevel to TRACE (very important to reproduce the issue, we have to slow
things)
* I stress with n=500000 with RF=3 and 
* I start 3 "nodetool repair -par -inc" in parallel  (Important, starting only 2 repairs doesn't
produce the issue everytime. it depends on the system load I guess)
* The issue is related to incremental repair. I can't reproduce the issue with seq repair
and/or par not incremental repair.
* The compaction strategy is not related. I can reproduce the issue with LCS and STCS.

Basically, the issue happens when the system is very busy and doesn't response enough fast.
In then function sendRR of MessagingService.java, a callback is added withn a timeout of 10
seconds. The endpoint doesn't response in 10 seconds, so we get the error. However, even if
we increase that timeout to 100 seconds in example, the system doesn't get better and the
load is still very high. We just get the error message "Lost notification, check server log
for repair state of keyspace ..." instead of "Repair failed with error Did not get positive
replies from all endpoints.". When the load is high (event after the repair), I checked quickly
with yourkit and what taking a lot of cpu time is the AntiEntropyStage thread, so the ActiveRepairService
that never ends?

Let me know if you I go deeper in the profiling, perhaps I could get a better profiling by
enabling a cassandra agent + yourkit.


>  "Did not get positive replies from all endpoints" error on incremental repair
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8316
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: cassandra 2.1.2
>            Reporter: Loic Lambiel
>            Assignee: Alan Boudreault
>         Attachments: test.sh
>
>
> Hi,
> I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster,
not yet loaded, RF=3)
> After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started
receiving "Repair failed with error Did not get positive replies from all endpoints." from
nodetool on all remaining nodes :
> [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace
xxxx (seq=false, full=false)
> [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from
all endpoints.
> All the nodes are up and running and the local system log shows that the repair commands
got started and that's it.
> I've also noticed that soon after the repair, several nodes started having more cpu load
indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then
restarted C* on these nodes and retried the repair on several nodes, which were successful
until facing the issue again.
> I tried to repro on our 3 nodes preproduction cluster without success
> It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
> Any idea?
> Thanks
> Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message