cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12689) All MutationStage threads blocked, kills server
Date Fri, 23 Sep 2016 18:16:20 GMT


Benjamin Roth commented on CASSANDRA-12689:

Adding the return statements fixed all the NPEs in the log. Unfortunately the MutationStage
problem occured once again.
Will observe it. I turned on the newly implemented RateBasedBackPressure, will see if it changes

> All MutationStage threads blocked, kills server
> -----------------------------------------------
>                 Key: CASSANDRA-12689
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Benjamin Roth
>            Priority: Critical
> Under heavy load (e.g. due to repair during normal operations), a lot of NullPointerExceptions
occur in MutationStage. Unfortunately, the log is not very chatty, trace is missing:
> 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService
Uncaught exception on thread Thread[MutationStage-1,5,main]: {}
> 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
> Then, after some time, in most cases ALL threads in MutationStage pools are completely
blocked. This leads to piling up pending tasks until server runs OOM and is completely unresponsive
due to GC. Threads will NEVER unblock until server restart. Even if load goes completely down,
all hints are paused, and no compaction or repair is running. Only restart helps.
> I can understand that pending tasks in MutationStage may pile up under heavy load, but
tasks should be processed and dequeud after load goes down. This is definitively not the case.
This looks more like a an unhandled exception leading to a stuck lock.
> Stack trace from jconsole, all Threads in MutationStage show same trace.
> Name: MutationStage-48
> State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
> Total blocked: 137  Total waited: 138.513
> Stack trace: 
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(
> java.util.concurrent.CompletableFuture$Signaller.block(
> java.util.concurrent.ForkJoinPool.managedBlock(
> java.util.concurrent.CompletableFuture.waitingGet(
> java.util.concurrent.CompletableFuture.get(
> org.apache.cassandra.db.Mutation.apply(
> org.apache.cassandra.db.Mutation.apply(
> org.apache.cassandra.hints.Hint.apply(
> org.apache.cassandra.hints.HintVerbHandler.doVerb(
> java.util.concurrent.Executors$
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$

This message was sent by Atlassian JIRA

View raw message