cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Graham Sanderson <gra...@vast.com>
Subject Re: Cassandra stalls and dropped messages not due to GC
Date Fri, 30 Oct 2015 03:35:54 GMT
Only if you actually change cassandra.yaml (that was the change in 2.1.6 which is why it matters
what version he upgraded from)

> On Oct 29, 2015, at 10:06 PM, Sebastian Estevez <sebastian.estevez@datastax.com>
wrote:
> 
> The thing about the CASSANDRA-9504 theory is that it was solved in 2.1.6 and Jeff's running
2.1.11.
> 
> @Jeff
> 
> How often does this happen? Can you watch ttop as soon as you notice increased read/write
latencies?
> 
> wget https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar
>  <https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar>java
-jar sjk-plus-0.3.6.jar ttop -s localhost:7199 -n 30 -o CPU
> 
> This should at least tell you which Cassandra threads are causing high memory allocations
 and CPU consumption.
> 
> All the best,
> 
>  <http://www.datastax.com/>
> Sebastián Estévez
> Solutions Architect | 954 905 8615 | sebastian.estevez@datastax.com <mailto:sebastian.estevez@datastax.com>
>  <https://www.linkedin.com/company/datastax>  <https://www.facebook.com/datastax>
 <https://twitter.com/datastax>  <https://plus.google.com/+Datastax/about>  <http://feeds.feedburner.com/datastax>
>  <http://goog_410786983/>
> 
>  <http://www.datastax.com/gartner-magic-quadrant-odbms>
> 
> DataStax is the fastest, most scalable distributed database technology, delivering Apache
Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on,
and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax
is the database technology and transactional backbone of choice for the worlds most innovative
companies such as Netflix, Adobe, Intuit, and eBay. 
> 
> On Thu, Oct 29, 2015 at 9:36 PM, Graham Sanderson <graham@vast.com <mailto:graham@vast.com>>
wrote:
> you didn’t say what you upgraded from, but if it is 2.0.x, then look at CASSANDRA-9504
> 
> If so and you use
> commitlog_sync: batch
> Then you probably want to set
> commitlog_sync_batch_window_in_ms: 1 (or 2)
> Note I’m only slightly convinced this is the cause because of your READ_REPAIR issues
(though if you are dropping a lot of MUTATIONS under load and your machines are overloaded,
you’d be doing more READ_REPAIR than usual probably)
> 
>> On Oct 29, 2015, at 8:12 PM, Jeff Ferland <jbf@tubularlabs.com <mailto:jbf@tubularlabs.com>>
wrote:
>> 
>> Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66
>> 
>> We upgraded our cluster this weekend and have been having issues with dropped mutations
since then. Intensely investigating a single node and toying with settings has revealed that
GC stalls don’t make up enough time to explain the 10 seconds of apparent stall that would
cause a hangup.
>> 
>> tpstats output typically shows active threads in the low single digits and pending
similar or 0. Before a failure, pending MutationStage will skyrocket into 5+ digits. System.log
regularly shows the gossiper complaining, then slow log complaints, then logs dropped mutations.
>> 
>> For the entire minute of 00:55, the gc logging shows no single pause > .14 seconds
and most of them much smaller. Abbreviated GC log after switching to g1gc (problem also exhibited
before G1GC):
>> 
>> 2015-10-30T00:55:00.550+0000: 6752.857: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:02.843+0000: 6755.150: [GC pause (GCLocker Initiated GC) (young)
>> 2015-10-30T00:55:05.241+0000: 6757.548: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:07.755+0000: 6760.062: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:10.532+0000: 6762.839: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:13.080+0000: 6765.387: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:15.914+0000: 6768.221: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:18.619+0000: 6770.926: [GC pause (GCLocker Initiated GC) (young)
>> 2015-10-30T00:55:23.270+0000: 6775.578: [GC pause (GCLocker Initiated GC) (young)
>> 2015-10-30T00:55:28.662+0000: 6780.969: [GC pause (GCLocker Initiated GC) (young)
>> 2015-10-30T00:55:33.326+0000: 6785.633: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:36.600+0000: 6788.907: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:40.050+0000: 6792.357: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:43.728+0000: 6796.035: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:48.216+0000: 6800.523: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:53.621+0000: 6805.928: [GC pause (G1 Evacuation Pause) (young)
>> 2015-10-30T00:55:59.048+0000: 6811.355: [GC pause (GCLocker Initiated GC) (young)
>> 
>> System log snippet of the pattern I’m seeing:
>> 
>> WARN  [GossipTasks:1] 2015-10-30 00:55:25,129  Gossiper.java:747 - Gossip stage has
1 pending tasks; skipping status check (no nodes will be marked down)
>> INFO  [CompactionExecutor:210] 2015-10-30 00:55:26,006  CompactionTask.java:141 -
Compacting [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'),
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'),
SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'), SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')]
>> WARN  [GossipTasks:1] 2015-10-30 00:55:26,230  Gossiper.java:747 - Gossip stage has
3 pending tasks; skipping status check (no nodes will be marked down)
>> WARN  [GossipTasks:1] 2015-10-30 00:55:27,330  Gossiper.java:747 - Gossip stage has
5 pending tasks; skipping status check (no nodes will be marked down)
>> WARN  [GossipTasks:1] 2015-10-30 00:55:28,431  Gossiper.java:747 - Gossip stage has
7 pending tasks; skipping status check (no nodes will be marked down)
>> WARN  [GossipTasks:1] 2015-10-30 00:55:29,531  Gossiper.java:747 - Gossip stage has
10 pending tasks; skipping status check (no nodes will be marked down)
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448  CqlSlowLogWriter.java:151
- Recording statements with duration of 16042 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451  CqlSlowLogWriter.java:151
- Recording statements with duration of 16047 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453  CqlSlowLogWriter.java:151
- Recording statements with duration of 16018 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454  CqlSlowLogWriter.java:151
- Recording statements with duration of 16042 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455  CqlSlowLogWriter.java:151
- Recording statements with duration of 16024 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,457  CqlSlowLogWriter.java:151
- Recording statements with duration of 16045 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,458  CqlSlowLogWriter.java:151
- Recording statements with duration of 16041 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,459  CqlSlowLogWriter.java:151
- Recording statements with duration of 16040 in slow log
>> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,461  CqlSlowLogWriter.java:151
- Recording statements with duration of 16037 in slow log
>> WARN  [GossipTasks:1] 2015-10-30 00:55:32,697  FailureDetector.java:257 - Not marking
nodes down due to local pause of 8668629909 <tel:8668629909> > 5000000000
>> INFO  [CompactionExecutor:212] 2015-10-30 00:55:32,864  ColumnFamilyStore.java:900
- Enqueuing flush of compactions_in_progress: 164 (0%) on-heap, 0 (0%) off-heap
>> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:347 - Writing
Memtable-compactions_in_progress@969357824(0.008KiB serialized bytes, 1 ops, 0%/0% of on/off-heap
limit)
>> INFO  [MemtableFlushWriter:846] 2015-10-30 00:55:32,865  Memtable.java:382 - Completed
flushing /mnt/cassandra/data/system/compactions_in_progress/system-compactions_in_progress-tmp-ka-1461769-Data.db
(0.000KiB) for commitlog position ReplayPosition(segmentId=1446159752729, position=8699402)
>> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,346  MessagingService.java:930 - MUTATION
messages were dropped in last 5000 ms: 67432 for internal timeout and 0 for cross node timeout
>> INFO  [ScheduledTasks:1] 2015-10-30 00:55:35,350  MessagingService.java:930 - READ_REPAIR
messages were dropped in last 5000 ms: 209 for internal timeout and 0 for cross node timeout
>> 
>> I’m starting to suspect I’ll have to take regular stack dumps and catch a bad
moment, but I figured I’d reach out here before going down that road to see if somebody
else had ever experienced a similar kind of pain.
>> 
>> -Jeff
>> 
> 
> 


Mime
View raw message