cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Dejanovski <a...@thelastpickle.com>
Subject Re: Reaper repair seems to "hang"
Date Wed, 04 Jan 2017 15:56:19 GMT
Actually, the problem is related to CASSANDRA-11430
<https://issues.apache.org/jira/browse/CASSANDRA-11430>.

Before 2.2.6, the notification service did not work with newly deprecated
repair methods, on which Reaper still currently relies.
C* 2.2.6 and onwards are not affected by this problem and work fine with
Reaper.

We're working on switching to the new repair method for 2.2 and 3.0/3.x,
which should be ready in a few days/weeks.

When using incremental repair, watch out for CASSANDRA-11696 which was
fixed in C* 2.1.15, 2.2.7, 3.0.8 and 3.8. In prior versions, unrepaired
SSTables can be marked as repaired, and thus never be repaired.

Cheers,



On Wed, Jan 4, 2017 at 6:09 AM Bhuvan Rawal <bhu1rawal@gmail.com> wrote:

> Hi Daniel,
>
> Looks like yours is a different case. If you're running incremental repair
> for the first time it make take long time esp. if table is large. And
> repair may seem to stuck even when things are working.
>
> You can try nodetool compactionstats when repair appears stuck, you'll
> find a validation compaction happening if that's indeed the case.
>
> For the first incremental repair you can follow this doc, in further
> repairs incremental repair should encounter very few sstables:
>
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>
> Regards,
> Bhuvan
>
>
>
> On Jan 4, 2017 3:52 AM, "Daniel Kleviansky" <daniel@kleviansky.com> wrote:
>
> Hi Bhuvan,
>
> Thank you so very much for your detailed reply.
> Just to ensure everyone is across the same information, and responses are
> not duplicated across two different forums, I thought I'd share with the
> mailing list that I've created a GitHub issue at:
> https://github.com/thelastpickle/cassandra-reaper/issues/39
>
> Kind regards,
> Daniel
>
> On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal <bhu1rawal@gmail.com> wrote:
>
> Hi Daniel,
>
> We faced a similar issue during repair with reaper. We ran repair with
> more repair threads than number of cassandra nodes. But on and off repair
> was getting stuck and we had to do rolling restart of cluster or wait for
> lock time to expire (~1hr).
>
> We had a look at the stuck repair, threadpools were getting stuck at
> AntiEntropy stage. From the synchronized block in repair code it appeared
> that per node at max 1 concurrent repair session per node is possible.
>
> According to
> https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk
>  :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
> by default)
> *2. Node is already running repair job*
>
> We tried running reaper with number of threads less than number of nodes
> (assuming reaper will not submit multiple segments to single cassandra
> node) but still it was observed that multiple repair segments were going to
> same node concurrently and threfore chances of nodes getting stuck in that
> state was possible. Finally we settled with single repair thread in reaper
> settings. Although takes a slightly more time but has completed
> successfully numerous times.
>
> Thread Dump of cassandra server when repair was getting stuck:
>
> "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x00007f0fa16226a0
> nid=0x3c82 waiting for monitor entry [0x00007ee9eabaf000*]
>    java.lang.Thread.State: BLOCKED (*on object monitor*)
>         at
> org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:392)
>         - waiting to lock <0x000000067c083308> (a
> org.apache.cassandra.service.ActiveRepairService)
>         at
> org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:417)
>         at org.apache.cassandra.repair
> .RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:145)
>         at org.apache.cassandra.net
> .MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> Hope it helps!
>
> Regards,
> Bhuvan
>
> According to
> https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk
>  :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
> by default)
> 2. Node is already running repair job
>
>
> On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski <
> alex@thelastpickle.com> wrote:
>
> Hi Daniel,
>
> could you file a bug in the issue tracker ?
> https://github.com/thelastpickle/cassandra-reaper/issues
>
> We'll figure out what's wrong and get your repairs running.
>
> Thanks !
>
> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky <daniel@kleviansky.com>
> wrote:
>
> Hi everyone,
>
> Using The Last Pickle's fork of Reaper, and unfortunately running into a
> bit of an issue. I'll try break it down below.
>
> # Problem Description:
> * After starting repair via the GUI, progress remains at 0/x.
> * Cassandra nodes calculate their respective token ranges, and then
> nothing happens.
> * There were no errors in the Reaper or Cassandra logs. Only a message of
> acknowledgement that a repair had initiated.
> * Performing stack trace on the running JVM, once can see that the thread
> spawning the repair process was waiting on a lock that was never being
> released.
> * This occurred on all nodes, and prevented any manually initiated repair
> process from running. A rolling restart of each node was required, after
> which one could run a `nodetool repair` successfully.
>
> # Cassandra Cluster Details:
> * Cassandra 2.2.5 running on Windows Server 2008 R2
> * 6 node cluster, split across 2 DCs, with RF = 3:3.
>
> # Reaper Details:
> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL
> database.
>
> ## Reaper settings:
> * Parallism: DC-Aware
> * Repair Intensity: 0.9
> * Incremental: true
>
> Don't want to swamp you with more details or unnecessary logs, especially
> as I'd have to sanitize them before sending them out, so please let me know
> if there is anything else I can provide, and I'll do my best to get it to
> you.
>
> ‚ÄčKind regards,
> Daniel
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
>
> --
> Daniel Kleviansky
> System Engineer & CX Consultant
> M: +61 (0) 499 103 043 <+61%20499%20103%20043> | E: daniel@kleviansky.com
> | W: http://danielkleviansky.com
>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Mime
View raw message