cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Ferland <...@tubularlabs.com>
Subject Re: Cassandra stalls and dropped messages not due to GC
Date Mon, 02 Nov 2015 18:55:27 GMT
Having caught a node in an undesirable state, many of my threads are reading like this:
"SharedPool-Worker-5" #875 daemon prio=5 os_prio=0 tid=0x00007f3e14196800 nid=0x96ce waiting
on condition [0x00007f3ddb835000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
        at org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283)
        at org.apache.cassandra.db.commitlog.PeriodicCommitLogService.maybeWaitForSync(PeriodicCommitLogService.java:44)
        at org.apache.cassandra.db.commitlog.AbstractCommitLogService.finishWriteFor(AbstractCommitLogService.java:152)
        at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:252)
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:379)
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359)
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:214)
        at org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:54)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
        at java.lang.Thread.run(Thread.java:745)

But commit log loading seems evenly spaced and low enough in volume:
/mnt/cassandra/commitlog$ ls -lht | head
total 7.2G
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:50 CommitLog-4-1446162051324.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:50 CommitLog-4-1446162051323.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:50 CommitLog-4-1446162051322.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:49 CommitLog-4-1446162051321.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:49 CommitLog-4-1446162051320.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:48 CommitLog-4-1446162051319.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:48 CommitLog-4-1446162051318.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:47 CommitLog-4-1446162051317.log
-rw-r--r-- 1 cassandra cassandra 32M Nov  2 18:46 CommitLog-4-1446162051316.log

Commit logs are on 10 second periodic setting:
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000

SSDs are fully trimmed out and mounted with discard since it snuck into my head that could
be an issue. Still stuck diagnosing this.

> On Oct 30, 2015, at 3:37 PM, Nate McCall <nate@thelastpickle.com> wrote:
> 
> Does tpstats show unusually high counts for blocked flush writers? 

The “All Time Blocked” metric is 0 across my entire cluster.

> As Sebastian suggests, running ttop will paint a clearer picture about what is happening
within C*. I would however recommend going back to CMS in this case as that is the devil we
all know and more folks will be able to offer advice on seeing its output (and it removes
a delta). 

Forgive me, but what is CMS?

> 
> It’s starting to look to me like it’s possibly related to brief IO spikes that are
smaller than my usual graphing granularity. It feels surprising to me that these would affect
the Gossip threads, but it’s the best current lead I have with my debugging right now. More
to come when I learn it.
> 
> Probably not the case since this was a result of an upgrade, but I've seen similar behavior
on systems where some kernels had issues with irqbalance doing the right thing and would end
up parking most interrupts on CPU0 (like say for the disk and ethernet modules) regardless
of the number of cores. Check out proc via 'cat /proc/interrupts' and make sure the interrupts
are spread out of CPU cores. You can steer them off manually at runtime if they are not spread
out. 

Interrupt loading is even.

> Also, did you upgrade anything besides Cassandra?

No. I’ve tried some mitigations since tuning thread pool sizes and GC, but the problem begins
with only an upgrade of Cassandra. No other system packages, kernels, etc.

-Jeff



Mime
View raw message