cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Ferland <...@tubularlabs.com>
Subject Cassandra stalls and dropped messages not due to GC
Date Fri, 30 Oct 2015 01:12:45 GMT
Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66

We upgraded our cluster this weekend and have been having issues with dropped mutations since
then. Intensely investigating a single node and toying with settings has revealed that GC
stalls don’t make up enough time to explain the 10 seconds of apparent stall that would
cause a hangup.

tpstats output typically shows active threads in the low single digits and pending similar
or 0. Before a failure, pending MutationStage will skyrocket into 5+ digits. System.log regularly
shows the gossiper complaining, then slow log complaints, then logs dropped mutations.

For the entire minute of 00:55, the gc logging shows no single pause > .14 seconds and
most of them much smaller. Abbreviated GC log after switching to g1gc (problem also exhibited
before G1GC):

2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) (young)
2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) (young)
2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) (young)
2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) (young)
2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) (young)
2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) (young)

System log snippet of the pattern I’m seeing:

WARN  [GossipTasks:1] 2015-10-30 00:55:25,129  Gossiper.java:747 - Gossip stage has 1 pending
tasks; skipping status check (no nodes will be marked down)
INFO  [CompactionExecutor:210] 2015-10-30 00:55:26,006  CompactionTask.java:141 - Compacting
[SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'),
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'),
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')]
WARN  [GossipTasks:1] 2015-10-30 00:55:26,230  Gossiper.java:747 - Gossip stage has 3 pending
tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2015-10-30 00:55:27,330  Gossiper.java:747 - Gossip stage has 5 pending
tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2015-10-30 00:55:28,431  Gossiper.java:747 - Gossip stage has 7 pending
tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2015-10-30 00:55:29,531  Gossiper.java:747 - Gossip stage has 10 pending
tasks; skipping status check (no nodes will be marked down)
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16042 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16047 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16018 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16042 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16024 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16045 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16041 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16040 in slow log
INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461  CqlSlowLogWriter.java:151 - Recording
statements with duration of 16037 in slow log
WARN  [GossipTasks:1] 2015-10-30 00:55:32,697  FailureDetector.java:257 - Not marking nodes
down due to local pause of 8668629909 > 5000000000
INFO  [CompactionExecutor:212] 2015-10-30 00:55:32,864  ColumnFamilyStore.java:900 - Enqueuing
flush of compactions_in_progress: 164 (0%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:347 - Writing Memtable-compactions_in_progress@969357824(0.008KiB
serialized bytes, 1 ops, 0%/0% of on/off-heap limit)
INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:382 - Completed flushing
/mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db
(0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, position=8699402)
INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,346  MessagingService.java:930 - MUTATION messages
were dropped in last 5000 ms: 67432 for internal timeout and 0 for cross node timeout
INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,350  MessagingService.java:930 - READ_REPAIR
messages were dropped in last 5000 ms: 209 for internal timeout and 0 for cross node timeout

I’m starting to suspect I’ll have to take regular stack dumps and catch a bad moment,
but I figured I’d reach out here before going down that road to see if somebody else had
ever experienced a similar kind of pain.

-Jeff


Mime
View raw message