cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12689) All MutationStage threads blocked, kills server
Date Sat, 01 Oct 2016 06:04:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537972#comment-15537972
] 

Benjamin Roth commented on CASSANDRA-12689:
-------------------------------------------

Hi Tyler,

Thanks for the review. Of course this solution is error prone, but as I stated earlier it's
IMHO the only one that fixes it now with no risk.

I had a conversation with @zznate these days and he asked me to remove that "ugly test switches"
like TEST_FORCE_DEFERABLE_MUTATIONS.
I personally don't care - I am just a newbie to CS. Either I can leave the test switches in,
apply your feedback and commit that dtest or I throw them away but then this situation is
not testable any more with dtest. 
And the next problem with dtest is: Only the positive test works nice. The negative test ends
in write timeouts and shows tons of errors. I implemented it for a proof but I would not recommend
to commit it. So the TEST_FORCE_DEFERABLE_MUTATIONS can also be thrown away, it's only required
for the negative test.
If I commit that dtest, should I create a new test file or append that test to an existing
one?
So, @thobbs + @zznate please tell me how to move on.

I already thought that PaxosState.commit maybe will allso need to be not deferrable but was
not sure if that also happens within the mutation stage. Unfortunately I am missing the bigger
picture in the CS code at the moment.

> All MutationStage threads blocked, kills server
> -----------------------------------------------
>
>                 Key: CASSANDRA-12689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12689
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Benjamin Roth
>            Assignee: Benjamin Roth
>            Priority: Critical
>             Fix For: 3.0.x, 3.x
>
>
> Under heavy load (e.g. due to repair during normal operations), a lot of NullPointerExceptions
occur in MutationStage. Unfortunately, the log is not very chatty, trace is missing:
> {noformat}
> 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService
Uncaught exception on thread Thread[MutationStage-1,5,main]: {}
> 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
> {noformat}
> Then, after some time, in most cases ALL threads in MutationStage pools are completely
blocked. This leads to piling up pending tasks until server runs OOM and is completely unresponsive
due to GC. Threads will NEVER unblock until server restart. Even if load goes completely down,
all hints are paused, and no compaction or repair is running. Only restart helps.
> I can understand that pending tasks in MutationStage may pile up under heavy load, but
tasks should be processed and dequeud after load goes down. This is definitively not the case.
This looks more like a an unhandled exception leading to a stuck lock.
> Stack trace from jconsole, all Threads in MutationStage show same trace.
> {noformat}
> Name: MutationStage-48
> State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
> Total blocked: 137  Total waited: 138.513
> {noformat}
> Stack trace: 
> {noformat}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
> org.apache.cassandra.hints.Hint.apply(Hint.java:96)
> org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
> java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message