Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C9DE1200BD2 for ; Fri, 28 Oct 2016 22:49:01 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C8BCE160AF5; Fri, 28 Oct 2016 20:49:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1EA43160B07 for ; Fri, 28 Oct 2016 22:49:00 +0200 (CEST) Received: (qmail 31098 invoked by uid 500); 28 Oct 2016 20:48:59 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 30566 invoked by uid 99); 28 Oct 2016 20:48:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2016 20:48:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A1BE02C2A69 for ; Fri, 28 Oct 2016 20:48:59 +0000 (UTC) Date: Fri, 28 Oct 2016 20:48:59 +0000 (UTC) From: "Tyler Hobbs (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-12689) All MutationStage threads blocked, kills server MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 28 Oct 2016 20:49:02 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616518#comment-15616518 ] Tyler Hobbs commented on CASSANDRA-12689: ----------------------------------------- The new test results look good, so +1, committed to 3.0 as {{d38a732ce15caab57ce6dddb3c0d6a436506db29}} and merged up to 3.X and trunk. Thanks! > All MutationStage threads blocked, kills server > ----------------------------------------------- > > Key: CASSANDRA-12689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12689 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths > Reporter: Benjamin Roth > Assignee: Benjamin Roth > Priority: Critical > Fix For: 3.0.10, 3.10 > > > Under heavy load (e.g. due to repair during normal operations), a lot of NullPointerExceptions occur in MutationStage. Unfortunately, the log is not very chatty, trace is missing: > {noformat} > 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught exception on thread Thread[MutationStage-1,5,main]: {} > 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null > {noformat} > Then, after some time, in most cases ALL threads in MutationStage pools are completely blocked. This leads to piling up pending tasks until server runs OOM and is completely unresponsive due to GC. Threads will NEVER unblock until server restart. Even if load goes completely down, all hints are paused, and no compaction or repair is running. Only restart helps. > I can understand that pending tasks in MutationStage may pile up under heavy load, but tasks should be processed and dequeud after load goes down. This is definitively not the case. This looks more like a an unhandled exception leading to a stuck lock. > Stack trace from jconsole, all Threads in MutationStage show same trace. > {noformat} > Name: MutationStage-48 > State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266 > Total blocked: 137 Total waited: 138.513 > {noformat} > Stack trace: > {noformat} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > org.apache.cassandra.db.Mutation.apply(Mutation.java:227) > org.apache.cassandra.db.Mutation.apply(Mutation.java:241) > org.apache.cassandra.hints.Hint.apply(Hint.java:96) > org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91) > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) > java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)