cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Hobbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12689) All MutationStage threads blocked, kills server
Date Tue, 04 Oct 2016 21:08:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546666#comment-15546666
] 

Tyler Hobbs commented on CASSANDRA-12689:
-----------------------------------------

bq. Thanks for the review. Of course this solution is error prone, but as I stated earlier
it's IMHO the only one that fixes it now with no risk.

Yes, I agree with that assessment.

bq. Either I can leave the test switches in, apply your feedback and commit that dtest or
I throw them away but then this situation is not testable any more with dtest. 

I believe test switches are acceptable when there's not a reasonable alternative for testing
something.  So, I would keep them here until we have a better alternative (such as this being
unit-testable).

bq. The negative test ends in write timeouts and shows tons of errors. I implemented it for
a proof but I would not recommend to commit it. So the TEST_FORCE_DEFERABLE_MUTATIONS can
also be thrown away, it's only required for the negative test.

Alright, let's remove that test and {{TEST_FORCE_DEFERABLE_MUTATIONS}}, then.

bq.  If I commit that dtest, should I create a new test file or append that test to an existing
one?

I would add the new test class to {{materialized_views_test.py}}.

bq. I already thought that PaxosState.commit maybe will allso need to be not deferrable but
was not sure if that also happens within the mutation stage.

I double checked this and it looks like it will get run on the mutation stage: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L266

Let me know if you have any other questions I can help with.

> All MutationStage threads blocked, kills server
> -----------------------------------------------
>
>                 Key: CASSANDRA-12689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12689
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Benjamin Roth
>            Assignee: Benjamin Roth
>            Priority: Critical
>             Fix For: 3.0.x, 3.x
>
>
> Under heavy load (e.g. due to repair during normal operations), a lot of NullPointerExceptions
occur in MutationStage. Unfortunately, the log is not very chatty, trace is missing:
> {noformat}
> 2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService
Uncaught exception on thread Thread[MutationStage-1,5,main]: {}
> 2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
> {noformat}
> Then, after some time, in most cases ALL threads in MutationStage pools are completely
blocked. This leads to piling up pending tasks until server runs OOM and is completely unresponsive
due to GC. Threads will NEVER unblock until server restart. Even if load goes completely down,
all hints are paused, and no compaction or repair is running. Only restart helps.
> I can understand that pending tasks in MutationStage may pile up under heavy load, but
tasks should be processed and dequeud after load goes down. This is definitively not the case.
This looks more like a an unhandled exception leading to a stuck lock.
> Stack trace from jconsole, all Threads in MutationStage show same trace.
> {noformat}
> Name: MutationStage-48
> State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
> Total blocked: 137  Total waited: 138.513
> {noformat}
> Stack trace: 
> {noformat}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
> org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
> org.apache.cassandra.hints.Hint.apply(Hint.java:96)
> org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
> java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message