cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Byrd (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-7533) Let MAX_OUTSTANDING_REPLAY_COUNT be configurable
Date Tue, 15 Jul 2014 01:12:04 GMT


Matt Byrd commented on CASSANDRA-7533:

Just to add a bit more context, we had a single instance of Cassandra get fairly stuck replaying
It was burning through 2000% cpu + for over four hours with no end in sight, so we killed
it removed commit logs brought it up and ran repair. (This was in q.a thankfully)

The problem can easily be reproduce by just writing 100,000 cql row (range deletes) to the
same partition key, stopping Cassandra and starting it again.
I admit this is somewhat of an anti-pattern, but still quite a dramatic effect from not very
much data.
The problem exercised here is that:
1. We contend in the memtable to do this insert in a CAS loop.
2. the work done in this loop becomes ever more expensive as RangeTombstoneList.dataSize is
iterated over to compute the size.

Point 2. effectively fixed in 2.1 with all the off-heap allocation, the dataSize calculation
effectively becomes more online.
To resolve this problem in 2.0 you could also keep this tally of dataSize online, or maybe
start keeping it online once the list is sufficiently big to cause a problem.
Doing this seemed to help a lot, but far simpler was just toggling the concurrency of the
commitlog replay, which can be achieved by lowering MAX_OUTSTANDING_REPLAY_COUNT (in our case
setting this to 1 seemed to help).


> ------------------------------------------------
>                 Key: CASSANDRA-7533
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jeremiah Jordan
>            Assignee: Yuki Morishita
>            Priority: Minor
>             Fix For: 2.0.10
> There are some workloads where commit log replay will run into contention issues with
multiple things updating the same partition.  Through some testing it was found that lowering MAX_OUTSTANDING_REPLAY_COUNT can help with this issue.
> The calculations added in CASSANDRA-6655 are one such place things get bottlenecked.

This message was sent by Atlassian JIRA

View raw message