Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A94D2200D2F for ; Wed, 1 Nov 2017 11:38:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A7DDF160BEA; Wed, 1 Nov 2017 10:38:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C545E160BE6 for ; Wed, 1 Nov 2017 11:38:06 +0100 (CET) Received: (qmail 90415 invoked by uid 500); 1 Nov 2017 10:38:05 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 90404 invoked by uid 99); 1 Nov 2017 10:38:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Nov 2017 10:38:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 1EC461A3F92 for ; Wed, 1 Nov 2017 10:38:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id joV7hZ6eThMZ for ; Wed, 1 Nov 2017 10:38:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id D2A1A60F74 for ; Wed, 1 Nov 2017 10:38:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 48CCFE0C25 for ; Wed, 1 Nov 2017 10:38:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 77DA724407 for ; Wed, 1 Nov 2017 10:38:00 +0000 (UTC) Date: Wed, 1 Nov 2017 10:38:00 +0000 (UTC) From: "Sergey Lapukhov (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-13849) GossipStage blocks because of race in ActiveRepairService MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Nov 2017 10:38:07 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Lapukhov updated CASSANDRA-13849: ---------------------------------------- Attachment: CAS-13849_3.patch > GossipStage blocks because of race in ActiveRepairService > --------------------------------------------------------- > > Key: CASSANDRA-13849 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13849 > Project: Cassandra > Issue Type: Bug > Reporter: Tom van der Woerdt > Assignee: Sergey Lapukhov > Priority: Major > Labels: patch > Fix For: 3.0.x, 3.11.x > > Attachments: CAS-13849.patch, CAS-13849_2.patch, CAS-13849_3.patch > > > Bad luck caused a kernel panic in a cluster, and that took another node with it because GossipStage stopped responding. > I think it's pretty obvious what's happening, here are the relevant excerpts from the stack traces : > {noformat} > "Thread-24004" #393781 daemon prio=5 os_prio=0 tid=0x00007efca9647400 nid=0xe75c waiting on condition [0x00007efaa47fe000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x000000052b63a7e8> (a java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) > at org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:332) > - locked <0x00000002e6bc99f0> (a org.apache.cassandra.service.ActiveRepairService) > at org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:211) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) > at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$3/1498438472.run(Unknown Source) > at java.lang.Thread.run(Thread.java:748) > "GossipTasks:1" #367 daemon prio=5 os_prio=0 tid=0x00007efc5e971000 nid=0x700b waiting for monitor entry [0x00007dfb839fe000] > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:421) > - waiting to lock <0x00000002e6bc99f0> (a org.apache.cassandra.service.ActiveRepairService) > at org.apache.cassandra.service.ActiveRepairService.convict(ActiveRepairService.java:776) > at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:306) > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:775) at org.apache.cassandra.gms.Gossiper.access$800(Gossiper.java:67) > at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:187) > at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) > at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$3/1498438472.run(Unknown Source) > at java.lang.Thread.run(Thread.java:748) > "GossipStage:1" #320 daemon prio=5 os_prio=0 tid=0x00007efc5b9f2c00 nid=0x6fcd waiting for monitor entry [0x00007e260186a000] > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:421) > - waiting to lock <0x00000002e6bc99f0> (a org.apache.cassandra.service.ActiveRepairService) at org.apache.cassandra.service.ActiveRepairService.convict(ActiveRepairService.java:776) > at org.apache.cassandra.service.ActiveRepairService.onRestart(ActiveRepairService.java:744) > at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1049) > at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1143) > at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) > at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) > at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$3/1498438472.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) > {noformat} > iow, org.apache.cassandra.service.ActiveRepairService.prepareForRepair holds a lock until the repair is prepared, which means waiting for other nodes to respond, which may die at exactly that moment, so they won't complete. Gossip will at the same time try to mark the node as down, but it requires that same lock :) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org