Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 52C24200C00 for ; Wed, 4 Jan 2017 06:09:33 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5146F160B46; Wed, 4 Jan 2017 05:09:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F16D8160B43 for ; Wed, 4 Jan 2017 06:09:31 +0100 (CET) Received: (qmail 41183 invoked by uid 500); 4 Jan 2017 05:09:30 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 41173 invoked by uid 99); 4 Jan 2017 05:09:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jan 2017 05:09:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E873B1A07A4 for ; Wed, 4 Jan 2017 05:09:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 6uKzkclH6_qO for ; Wed, 4 Jan 2017 05:09:27 +0000 (UTC) Received: from mail-wj0-f169.google.com (mail-wj0-f169.google.com [209.85.210.169]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id D9A845F33E for ; Wed, 4 Jan 2017 05:09:26 +0000 (UTC) Received: by mail-wj0-f169.google.com with SMTP id tn15so1895264wjb.1 for ; Tue, 03 Jan 2017 21:09:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=zGA4SxK0synA7qPp3eP6q20dFGrbCzFj2TML2hiDPuA=; b=fEha36/nGnHSaQJPA1ptsWbJ7Q5fRFHC9cCkKPJgqwBR85bG2lS5wwGFo8OxfkEbpP aPEwSTwf/1vpaxdyLyxPlyL+E1QbvGCM7WS8G4tuRy4YfS9Yuo1VZVvw+kR6vFYsDVQp 1ZV6UMTjTwEniUwo99/aGZ65veJVeaVSSVxFccrM9msPNYpgh/4aAINEDVgP7ULLW3C6 Xa91lCOiIe1FRtxt0c++XFpqlJ/mYXodhyKvH4iiGJbgbUACrIF1xHyLtv6OJpa7jUHj G78jrmd5JQnKVIUw9vmX60VERfsnKtMoX8xtykT//r0IIJr0KTa2x5LfPhevM7dbqGDQ q1PQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=zGA4SxK0synA7qPp3eP6q20dFGrbCzFj2TML2hiDPuA=; b=iW6zbZEY59FwThPmPmhuTvKd6PQ61HVbg3lv2qD6WLqJqY66RQD8L3kTGusKjtZjfh uHiafKgj+BH1Yb0VPm0+MaGAHSK/D2z+FhzFTuq6pYWmvlLxaB1SJDWCNGDcgmY7zSOp lb5MSjnppCjz0R7E9WZfr9v8sXP0+PeZ95vvh+hTsaTuLhhpGhlaqcdFOE/Fw6FW+hBC 1iv/c0Ix7nUCdMwusFW7bzVCo7f4yV828uBfWOY/TuELwmopzFpXnrmXjPMVpq2Prbgy rqiaW5Wk2OYjeWaYu3QpRtSGEo1d+xuwyGaS3qrZkI/1I6CKeI9Z3FoIqXRtQcda8kne lv5Q== X-Gm-Message-State: AIkVDXKoVCKzdTVuugyV1jiutSDE4LXqQ9TIVwcfDxBi/76iT5p2L1/dan6YR+phAAOf03gzYp2pLgb+XLRTSQ== X-Received: by 10.194.162.8 with SMTP id xw8mr53147182wjb.125.1483506545117; Tue, 03 Jan 2017 21:09:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.28.16.8 with HTTP; Tue, 3 Jan 2017 21:09:04 -0800 (PST) Received: by 10.28.16.8 with HTTP; Tue, 3 Jan 2017 21:09:04 -0800 (PST) In-Reply-To: References: From: Bhuvan Rawal Date: Wed, 4 Jan 2017 10:39:04 +0530 Message-ID: Subject: Re: Reaper repair seems to "hang" To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=089e011829b2a4d1ed05453dc849 archived-at: Wed, 04 Jan 2017 05:09:33 -0000 --089e011829b2a4d1ed05453dc849 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Daniel, Looks like yours is a different case. If you're running incremental repair for the first time it make take long time esp. if table is large. And repair may seem to stuck even when things are working. You can try nodetool compactionstats when repair appears stuck, you'll find a validation compaction happening if that's indeed the case. For the first incremental repair you can follow this doc, in further repairs incremental repair should encounter very few sstables: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNo= desMigration.html Regards, Bhuvan On Jan 4, 2017 3:52 AM, "Daniel Kleviansky" wrote: Hi Bhuvan, Thank you so very much for your detailed reply. Just to ensure everyone is across the same information, and responses are not duplicated across two different forums, I thought I'd share with the mailing list that I've created a GitHub issue at: https://github.com/ thelastpickle/cassandra-reaper/issues/39 Kind regards, Daniel On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal wrote: > Hi Daniel, > > We faced a similar issue during repair with reaper. We ran repair with > more repair threads than number of cassandra nodes. But on and off repair > was getting stuck and we had to do rolling restart of cluster or wait for > lock time to expire (~1hr). > > We had a look at the stuck repair, threadpools were getting stuck at > AntiEntropy stage. From the synchronized block in repair code it appeared > that per node at max 1 concurrent repair session per node is possible. > > According to https://medium.com/@mlowicki/cassandra-reaper-introductio > n-ed73410492bf#.f0erygqpk : > > Segment runner has protection mechanism to avoid overloading nodes using > two simple rules to postpone repair if: > > 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS= * > (20 by default) > *2. Node is already running repair job* > > We tried running reaper with number of threads less than number of nodes > (assuming reaper will not submit multiple segments to single cassandra > node) but still it was observed that multiple repair segments were going = to > same node concurrently and threfore chances of nodes getting stuck in tha= t > state was possible. Finally we settled with single repair thread in reape= r > settings. Although takes a slightly more time but has completed > successfully numerous times. > > Thread Dump of cassandra server when repair was getting stuck: > > "*AntiEntropyStage:1" #159 daemon prio=3D5 os_prio=3D0 tid=3D0x00007f0fa1= 6226a0 > nid=3D0x3c82 waiting for monitor entry [0x00007ee9eabaf000*] > java.lang.Thread.State: BLOCKED (*on object monitor*) > at org.apache.cassandra.service.ActiveRepairService.removeParen > tRepairSession(ActiveRepairService.java:392) > - waiting to lock <0x000000067c083308> (a > org.apache.cassandra.service.ActiveRepairService) > at org.apache.cassandra.service.ActiveRepairService.doAntiCompa > ction(ActiveRepairService.java:417) > at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb( > RepairMessageVerbHandler.java:145) > at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeli > veryTask.java:67) > at java.util.concurrent.Executors$RunnableAdapter.call(Executor > s.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool > Executor.java:1142) > > Hope it helps! > > Regards, > Bhuvan > > According to https://medium.com/@mlowicki/cassandra-reaper-introductio > n-ed73410492bf#.f0erygqpk : > > Segment runner has protection mechanism to avoid overloading nodes using > two simple rules to postpone repair if: > > 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS= * > (20 by default) > 2. Node is already running repair job > > > On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski < > alex@thelastpickle.com> wrote: > >> Hi Daniel, >> >> could you file a bug in the issue tracker ? https://github.com/thelastpi >> ckle/cassandra-reaper/issues >> >> We'll figure out what's wrong and get your repairs running. >> >> Thanks ! >> >> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky >> wrote: >> >>> Hi everyone, >>> >>> Using The Last Pickle's fork of Reaper, and unfortunately running into = a >>> bit of an issue. I'll try break it down below. >>> >>> # Problem Description: >>> * After starting repair via the GUI, progress remains at 0/x. >>> * Cassandra nodes calculate their respective token ranges, and then >>> nothing happens. >>> * There were no errors in the Reaper or Cassandra logs. Only a message >>> of acknowledgement that a repair had initiated. >>> * Performing stack trace on the running JVM, once can see that the >>> thread spawning the repair process was waiting on a lock that was never >>> being released. >>> * This occurred on all nodes, and prevented any manually initiated >>> repair process from running. A rolling restart of each node was require= d, >>> after which one could run a `nodetool repair` successfully. >>> >>> # Cassandra Cluster Details: >>> * Cassandra 2.2.5 running on Windows Server 2008 R2 >>> * 6 node cluster, split across 2 DCs, with RF =3D 3:3. >>> >>> # Reaper Details: >>> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQ= L >>> database. >>> >>> ## Reaper settings: >>> * Parallism: DC-Aware >>> * Repair Intensity: 0.9 >>> * Incremental: true >>> >>> Don't want to swamp you with more details or unnecessary logs, >>> especially as I'd have to sanitize them before sending them out, so ple= ase >>> let me know if there is anything else I can provide, and I'll do my bes= t to >>> get it to you. >>> >>> =E2=80=8BKind regards, >>> Daniel >>> >> -- >> ----------------- >> Alexander Dejanovski >> France >> @alexanderdeja >> >> Consultant >> Apache Cassandra Consulting >> http://www.thelastpickle.com >> > > --=20 Daniel Kleviansky System Engineer & CX Consultant M: +61 (0) 499 103 043 | E: daniel@kleviansky.com | W: http://danielkleviansky.com --089e011829b2a4d1ed05453dc849 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Daniel,

L= ooks like yours is a different case. If you're running incremental repa= ir for the first time it make take long time esp. if table is large. And re= pair may seem to stuck even when things are working.=C2=A0

You can try nodetool compactionstats whe= n repair appears stuck, you'll find a validation compaction happening i= f that's indeed the case.=C2=A0

For the first incremental repair you can follow this doc, in fu= rther repairs incremental repair should encounter very few sstables:
<= div dir=3D"auto">https://docs.datastax.com/en= /cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html

Regards,
Bhuvan



On Jan 4, 2017 3:52 AM, "= ;Daniel Kleviansky" <danie= l@kleviansky.com> wrote:
Hi Bhuvan,

Thank you so very much for your detailed= reply.
Just to ensure everyone is across the same information, and response= s are not duplicated across two different forums, I thought I'd share w= ith the mailing list that I've created a GitHub issue at:=C2=A0https://github.com/thelastpickle/cassandra-reaper/issues/= 39

Kind regards,
Daniel

On Wed, Jan 4, 2017 at 6:31 AM= , Bhuvan Rawal <bhu1rawal@gmail.com> wrote:
Hi Daniel,

We faced= a similar issue during repair with reaper. We ran repair with more repair = threads than number of cassandra nodes. But on and off repair was getting s= tuck and we had to do rolling restart of cluster or wait for lock time to e= xpire (~1hr).=C2=A0

We had a look at the stuck rep= air, threadpools were getting stuck at AntiEntropy stage. From the synchron= ized block in repair code it appeared that per node at max 1 concurrent rep= air session per node is possible.=C2=A0


Segment runner has prote= ction mechanism to avoid overloading nodes using two simple rules to postpo= ne=C2=A0repair=C2=A0if:=C2=A0
=
1. Number of p= ending compactions is greater than=C2=A0MAX_= PENDING_COMPACTIONS=C2=A0(20 by default)
2. Node is already running=C2=A0repair= =C2=A0job

We tried running reaper= with number of threads less than number of nodes (assuming reaper will not= submit multiple segments to single cassandra node) but still it was observ= ed that multiple repair segments were going to same node concurrently and t= hrefore chances of nodes getting stuck in that state was possible. Finally = we settled with single repair thread in reaper settings. Although takes a s= lightly more time but has completed successfully numerous times.
=
Thread Dump of cassandra server when repair was getting stuc= k:

"AntiEntropyStage:1" #159 daemon prio=3D5= os_prio=3D0 tid=3D0x00007f0fa16226a0 nid=3D0x3c82 waiting for monitor entr= y [0x00007ee9eabaf000]
=C2=A0 =C2=A0java.lang.Thread.State: BLOCKE= D (on object monitor)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache= .cassandra.service.ActiveRepairService.removeParentRepairSession(= ActiveRepairService.java:392)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - waiti= ng to lock <0x000000067c083308> (a org.apache.cassandra.service.ActiveRepairService)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.cassand= ra.service.ActiveRepairService.doAntiCompaction(ActiveRepairServi= ce.java:417)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.cassandra.= repair= .RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.= java:145)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 at java.util.concurrent.Executors$RunnableAdapter.call(E= xecutors.java:511)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concu= rrent.FutureTask.run(FutureTask.java:266)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool= Executor.java:1142)

Hope it helps!

Rega= rds,
Bhuvan


repair=C2=A0if:=C2=A0

1. Number of pending compactions is greater than=C2=A0MAX_PENDING_COMPACTIONS=C2=A0(20 by d= efault)
2. Node is alread= y running=C2=A0repair=C2=A0job


On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanov= ski <alex@thelastpickle.com> wrote:
Hi Daniel,

could you fi= le a bug in the issue tracker ?=C2=A0https://github.com/thelast= pickle/cassandra-reaper/issues=C2=A0

We&#= 39;ll figure out what's wrong and get your repairs running.
<= br>
Thanks !

On Tue, Jan 3, = 2017 at 12:35 AM Daniel Kleviansky <daniel@kleviansky.com> wrote:
Hi everyone,

Using The Last Pickle's fork of R= eaper, and unfortunately running into a bit of an issue. I'll try break= it down below.

# Problem Description:
* After starting repair via the GUI, progress remains at 0/= x.
* Cassandr= a nodes calculate their respective token ranges, and then nothing happens.<= /font>
* There were = no errors in the Reaper or Cassandra logs. Only a message of acknowledgemen= t that a repair had initiated.
* Performing stack trace on the running JVM, once can see= that the thread spawning the repair process was waiting on a lock that was= never being released.
* This occurred on all nodes, and prevented any manually initiate= d repair process from running. A rolling restart of each node was required,= after which one could run a `nodetool repair` successfully.

# Cassandra Cluster Deta= ils:
* Cassan= dra 2.2.5 running on Windows Server 2008 R2
* 6 node cluster, split across 2 DCs, with R= F =3D 3:3.
# Reaper Details:
* Reaper 0.3.3 running on Windows Server 2008 R2, utilising a Postgre= SQL database.

## Reaper settings:
* Parallism: DC-Aware
* Repair Intensity: 0.9
* Incremental: true

= Don't want to swamp you wi= th more details or unnecessary logs, especially as I'd have to sanitize= them before sending them out, so please let me know if there is anything e= lse I can provide, and I'll do my best to get it to you.

=E2=80=8BKind regards,
Daniel
--
-----------------
Alexander Dejanovs= ki
France
@alex= anderdeja

= Consultant
Apache Cassandra Consulting
<= div style=3D"font-family:"helvetica neue",helvetica,arial,sans-se= rif;line-height:19.5px">http://www.thelastpickle.com




--
Daniel Kleviansky
System Engineer & CX Consultant
M: +61 (0) 499 = 103 043 | E: dan= iel@kleviansky.com | W:=C2=A0http://danielkleviansky.com
<= /div>

--089e011829b2a4d1ed05453dc849--