cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Podkowinski (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12966) Gossip thread slows down when using batch commit log
Date Wed, 30 Nov 2016 16:39:58 GMT


Stefan Podkowinski commented on CASSANDRA-12966:

Seems like the gossip single thread execution is a bit problematic, as this also caused some
pain in CASSANDRA-12281. Looks like CASSANDRA-8398 will be a good thing to have here.

Some comments regarding your patch:

My thoughts on concurrency aspects:
StorageService.handleStateNormal will update tokens for both TokenMetadata and SystemKeyspace.
previous blocking behavior would ensure both would be in-sync. Offloading the system table
update to the mutation stage would allow to have the table lag behind, but I would not expect
any races between mutations, as the execution order hasn't changed, just the executor.
Uncoupling the mutations this way without waiting for the write result shouldn't be a problem,
as the system table is only used during initialization and there's no guarantees that the
gossip state for a node is always recent anyways.

The synchronized keywords for removeEndpoints looks like a leftover from when the code would
read and write back the modified token set and it should be safe to remove it.

As for API modifications, there are now two updateToken versions, one blocking and one asynchronous.
Maybe async methods should be named differently, as the Future return value will not be checked
in the code and you wouldn't be able to tell which version is called by reading code on the
caller side.

> Gossip thread slows down when using batch commit log
> ----------------------------------------------------
>                 Key: CASSANDRA-12966
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
> When using batch commit log mode, the Gossip thread slows down when peers after a node
bounces. This is because we perform a bunch of updates to the peers table via {{SystemKeyspace.updatePeerInfo}},
which is a synchronized method. How quickly each one of those individual updates takes depends
on how busy the system is at the time wrt write traffic. If the system is largely quiescent,
each update will be relatively quick (just waiting for the fsync). If the system is getting
a lot of writes, and depending on the commitlog_sync_batch_window_in_ms, each of the Gossip
thread's updates can get stuck in the backlog, which causes the Gossip thread to stop processing.
We have observed in large clusters that a rolling restart causes triggers and exacerbates
this behavior. 

This message was sent by Atlassian JIRA

View raw message