cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-3853) lower impact on old-gen promotion of slow nodes or connections
Date Sun, 05 Feb 2012 10:23:54 GMT
lower impact on old-gen promotion of slow nodes or connections
--------------------------------------------------------------

                 Key: CASSANDRA-3853
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3853
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Peter Schuller
            Assignee: Peter Schuller


Cassandra has the unfortunate behavior that when things are "slow" (nodes overloaded, etc)
there is a tendency for cascading failure if the system is overall under high load. This is
generally true of most systems, but one way in which it is worse than desired is the way we
queue up things between stages and outgoing requests.

First off, I use the following premises:

* The node is not running Azul ;)
* The total cost of ownership (in terms of allocation+collection) of an object that dies in
old-gen is *much* higher than that of an object that dies in young gen.
* When CMS fails (concurrent mode failure or promotion failure), the resulting full GC is
*serial* and does not use all cores, and is a stop-the-world pause.

Here is how this very effectively leads to cascading failure of the "fallen and can't get
up" kind:

* Some node has a problem and is slow, even if just for a little while.
* Other nodes, especially neighbors in the replica set, start queueing up outgoing requests
to the node for {{rpc_timeout}} milliseconds.
* You have a high (let's say write) throughput of 50 thousand or so requests per second per
node.
* Because you want writes to be highly available and you are okay with high latency, you have
an {{rpc_timeout}} of 60 seconds.
* The total amount of memory used for 60 * 50 000 requests is freaking high.
* The young gen GC pauses happen *much* more frequently than every 60 seconds.
* The result is that when a node goes down, other nodes in the replica set start *massively*
increasing their promotion rate into old gen. A cluster whose nodes are normally completely
fine, with slow nice promotion into old-gen, will now exhibit vastly different behavior than
normal: While the total allocation rate doesn't change (or not very much, perhaps a little
if clients are doing re-tries), the promotion rate into old-gen increases massively.
* This increases the total cost of ownership, and thus demand for CPU resources.
* You will *very* easily see CMS' sweeping phase not stand a chance to sweep up fast enough
to keep up with the incoming request rate, even with a hugely inflated heap (CMS sweeping
is not parallel, even though marking is).
* This leads to promotion failure/conc mode failure, and you fall into full GC.
* But now, your full GC is effectively stealing CPU resources since you are forcing all cores
but one to be completely idle on your system.
* Once you go out of GC, you now have a huge backlog of work to do that you get bombarded
with from other nodes that thought it was a good idea to retain 30 seconds worth of messages
in *their* heap. So you're now being instantly shot down again by your neighbors, falling
into the next full GC cycle even easier than originally.
* Meanwhile, the fact that you are in full gc, is causing your neighbors to enter the same
predicament.

The "solution" to this in production is to rapidly restart all nodes in the replica set. Doing
a live-change of RPC timeouts to something very very low might also do the trick.

This is a specific instance of the overall problem that we should IMO not be queueing up huge
amounts of data in memory. Just recently I saw a node with *10 million* requests pending.

We need to:

* Have support for more aggressively dropping requests instead of queueing them when sending
to other nodes.
* More aggressively drop requests internally; there is very little use to queueing up hundreds
of thousands of requests pending for MutationStage or ReadStage, etc. Especially not ReadStage
where any response is irrelevant once timeout has been reached.

A complication here is that we *cannot* just drop requests so quickly that we never promote
into old-gen. If we were to drop requests that quickly when outgoing, we would be dropping
requests every time another node goes into young gc. And if we retain requests long enough
for other node's young gc, it also means we retain them long enough for promotion into old-gen
with us (not strictly true with survivor spaces, but we can't assume to target the distinction
there with any accuracy).

A possible alternative is to ask users to be better about using short timeouts, but that probably
ups the priority on controlling timeouts on a per-request basis rather than as coarse-grained
server-side settings. Even with shorter timeouts though, we still need to be careful about
dropping requests in places it makes sense to avoid accumulating more than a timeout's worth
of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message