I relaunched my cluster from the scratch (due to another reason). After the relaunch I could ran nodetool repair -par -inc -pr on the nodes without issue, but pretty match the moment when I started pushing production load to the cluster I ran into the same problem again. I opened a ticket first for adding logging info, but I'll most probably end up adding the logging by myself and I'll start digging through into the actual root cause.

I also ran one nodetool repair -par (ie. without incremental repair) and it seems that the repair started. Guess I need to go over the sources if there's a different code path which would explain this.

I can't yet call this conclusive, but it seems that I can't run incremental repairs on the current 2.1.1 and I'm still wondering if anybody else is experiencing the same problem.

On Thu, Oct 30, 2014 at 1:14 PM, Juho Mäkinen <juho.makinen@gmail.com> wrote:
No, the cluster seems to be performing just fine. It seems that the prepareForRepair callback() could be easily modified to print which node(s) are unable to respond, so that the debugging effort could be focused better. This of course doesn't help this case as it's not trivial to add the log lines and to roll it out to the entire cluster.

The cluster is relatively young, containing only 450GB with RF=3 spread over nine nodes and I'm still practicing how to run incremental repairs on the cluster when I stumbled on this issue.

On Thu, Oct 30, 2014 at 12:52 PM, Rahul Neelakantan <rahul@rahul.be> wrote:
It appears to come from the ActiveRepairService.prepareForRepair portion of the Code.

Are you sure all nodes are reachable from the node you are initiating repair on, at the same time?

Any Node up/down/died messages?

Rahul Neelakantan

> On Oct 30, 2014, at 6:37 AM, Juho Mäkinen <juho.makinen@gmail.com> wrote:
>
> I'm having problems running nodetool repair -inc -par -pr on my 2.1.1 cluster due to "Did not get positive replies from all endpoints" error.
>
> Here's an example output:
> root@db08-3:~# nodetool repair -par -inc -pr
> [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
> [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256 ranges for keyspace profiles (seq=false, full=false)
> [2014-10-30 10:33:17,240] Repair failed with error Did not get positive replies from all endpoints.
> [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256 ranges for keyspace OpsCenter (seq=false, full=false)
> [2014-10-30 10:33:32,242] Repair failed with error Did not get positive replies from all endpoints.
> [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256 ranges for keyspace system_traces (seq=false, full=false)
> [2014-10-30 10:33:44,243] Repair failed with error Did not get positive replies from all endpoints.
>
> The local system log shows that the repair commands got started, but it seems that they immediately get cancelled due to that error, which btw can't be seen in the cassandra log.
>
> I tried monitoring all logs from all machines in case another machine would show up with some useful error, but so far I haven't found nothing.
>
> Any ideas where this error comes from?
>
>  - Garo
>