Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6E335200D0C for ; Wed, 20 Sep 2017 13:49:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6CD161609E4; Wed, 20 Sep 2017 11:49:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8C68D1609E1 for ; Wed, 20 Sep 2017 13:49:04 +0200 (CEST) Received: (qmail 77410 invoked by uid 500); 20 Sep 2017 11:49:03 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 77399 invoked by uid 99); 20 Sep 2017 11:49:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Sep 2017 11:49:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1E056184579 for ; Wed, 20 Sep 2017 11:49:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9psxY2sy6g6x for ; Wed, 20 Sep 2017 11:49:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 0165861263 for ; Wed, 20 Sep 2017 11:49:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 80DEBE0E33 for ; Wed, 20 Sep 2017 11:49:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 35E2F218F3 for ; Wed, 20 Sep 2017 11:49:00 +0000 (UTC) Date: Wed, 20 Sep 2017 11:49:00 +0000 (UTC) From: "Thomas Steinmaurer (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-13885) Allow to run full repairs in 3.0 without additional cost of anti-compaction MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 20 Sep 2017 11:49:05 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13885?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D16173049#comment-16173049 ]=20 Thomas Steinmaurer edited comment on CASSANDRA-13885 at 9/20/17 11:48 AM: -------------------------------------------------------------------------- It is about ease the operational side and that 2.2+ is a major shift toward= s behaving differently and being much more complex when I simply want to ru= n a full repair across my 9 node cluster on 2 small volume CFs on a daily b= asis (grace period =3D 72hr) and being used to so by running the following = with 2.1 kicked off in parallel on all nodes: {code} nodetool repair -pr mykeyspace mycf1 mycf2 {code} Ok, I learned incremental repair being the default since 2.2+, so I need to= additionally apply the -full option. Ok, not a big deal, but when running = the following with 3.0.14, again kicked off in parallel on all nodes: {code} nodetool repair -full -pr mykeyspace mycf1 mycf2 {code} I start to see basically the following nodetool output: {code} ... [2017-09-20 11:34:49,968] Some repair failed [2017-09-20 11:34:49,968] Repair command #8 finished in 0 seconds error: Repair job has failed with the error message: [2017-09-20 11:34:49,9= 68] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [= 2017-09-20 11:34:49,968] Some repair failed at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.ja= va:115) at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressL= istener.handleNotification(JMXNotificationProgressListener.java:77) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.di= spatchNotification(ClientNotifForwarder.java:583) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.do= Run(ClientNotifForwarder.java:533) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.ru= n(ClientNotifForwarder.java:452) at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$= 1.run(ClientNotifForwarder.java:108) {code} With corresponding entries in the Cassandra log: {noformat} ... 6084592,-2610211481793768452], (280506507907773715,302389115279520703], (-5= 974981857606828384,-5962141498717352776], (6642604399479339844,666459638471= 6805222], (3176178340546590823,3182242320217954219], (6534347373256357699,6= 534785652363368819], (-3756238465673315474,-3752190783358815211], (71396779= 86395944961,7145455101208653220], (-3297144043975661711,-327461217764843180= 3], (5273980670821159743,5281982202791896119], (-6128989336346960670,-60804= 68590993099589], (-2173810736498649004,-2131529908597487459], (743977363685= 5937356,7476905072738807852]]] Validation failed in /10.176.38.128 at org.apache.cassandra.repair.ValidationTask.treesReceived(Validat= ionTask.java:68) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.repair.RepairSession.validationComplete(Rep= airSession.java:178) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.service.ActiveRepairService.handleMessage(A= ctiveRepairService.java:486) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(Repa= irMessageVerbHandler.java:164) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDelivery= Task.java:67) ~[apache-cassandra-3.0.14.jar:3.0.14] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.ja= va:511) ~[na:1.8.0_102] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.= 8.0_102] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExec= utor.java:1142) [na:1.8.0_102] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe= cutor.java:617) [na:1.8.0_102] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$thread= LocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.0.14.jar= :3.0.14] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_102] INFO [InternalResponseStage:32] 2017-09-20 11:41:58,054 RepairRunnable.jav= a:337 - Repair command #11 finished in 0 seconds ERROR [ValidationExecutor:29] 2017-09-20 11:41:58,056 Validator.java:268 - = Failed creating a merkle tree for [repair #b53b44a0-9df8-11e7-916c-a5c15f10= 854d on ruxitdb/Me2Data, [(-9036672081060178828,-9030154922268771156], (146= 9740174912727009,1543926123757478678], (8863036841963129257,886711445864155= 5677], (-2610211481793768452,-2603133469451342452], (-5434810958758711978,-= 5401236033897257975], (5446456273884963354,5512385756828046297], (-57338499= 16893192315,-5651354489457211297], (5579261856873396905,5629665914232130557= ], (-3661618321040339655,-3653143301436649195], (-3344525143879048394,-3314= 190367243835481], (2113416595214497156,2140252649319845130], (-186804760253= 388038,-136455684914788326], (130823363710141924,188931062065209030], (2293= 72617650564758,256901816244047153], (-3460004924864535758,-3448189173914847= 013], (7667789006793829873,7672435884237063221], (-5401236033897257975,-537= 1782704264523053], (-3829469150597291433,-3823438964996675746], (8833078706= 147578756,8850650250670324319], (5112280378866264088,5193085768303122438], = (4155723864378803139,4171414017862833361], (-840951991332283834,-8203894641= 84628689], (-8599778977804844748,-8579712223690479957], (690067832142352362= 3,6900784348977090766], (-7453077334586977466,-7449408715037121306], (17031= 84128556034757,1708159674820812561], (772306949709931532,799988896726778408= ], (-5294307699953409870,-52800750682 ... {noformat} was (Author: tsteinmaurer): It is about ease the operational side and that 2.2+ is a major shift toward= s behaving differently and being much more complex when I simply want to ru= n a full repair across my 9 node cluster on 2 small volume CFs on a daily b= asis (grace period =3D 72hr) and being used to so by running the following = with 2.1 kicked off in parallel on all nodes: {code} nodetool repair -pr mykeyspace mycf1 mycf2 {code} Ok, I learned incremental repair being the default since 2.2+, so I need to= additionally apply the -full option. Ok, not a big deal, but when running = the following with 3.0.14, again kicked off in parallel on all nodes: {code} nodetool repair -full -pr mykeyspace mycf1 mycf2 {code} I start to see basically the following nodetool output: {code} ... [2017-09-20 11:34:49,968] Some repair failed [2017-09-20 11:34:49,968] Repair command #8 finished in 0 seconds error: Repair job has failed with the error message: [2017-09-20 11:34:49,9= 68] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [= 2017-09-20 11:34:49,968] Some repair failed at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.ja= va:115) at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressL= istener.handleNotification(JMXNotificationProgressListener.java:77) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.di= spatchNotification(ClientNotifForwarder.java:583) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.do= Run(ClientNotifForwarder.java:533) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.ru= n(ClientNotifForwarder.java:452) at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$= 1.run(ClientNotifForwarder.java:108) {code} > Allow to run full repairs in 3.0 without additional cost of anti-compacti= on > -------------------------------------------------------------------------= -- > > Key: CASSANDRA-13885 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1388= 5 > Project: Cassandra > Issue Type: Bug > Reporter: Thomas Steinmaurer > > This ticket is basically the result of the discussion in Cassandra user l= ist: https://www.mail-archive.com/user@cassandra.apache.org/msg53562.html > I was asked to open a ticket by Paulo Motta to think about back-porting r= unning full repairs without the additional cost of anti-compaction. > Basically there is no way in 3.0 to run full repairs from several nodes c= oncurrently without troubles caused by (overlapping?) anti-compactions. Com= ing from 2.1 this is a major change from an operational POV, basically brea= king any e.g. cron job based solution kicking off -pr based repairs on seve= ral nodes concurrently. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org