cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
Date Wed, 17 Sep 2014 06:03:34 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136794#comment-14136794
] 

Benedict edited comment on CASSANDRA-7937 at 9/17/14 6:03 AM:
--------------------------------------------------------------

Under these workload conditions, CASSANDRA-3852 is unlikely to help; it is more for reads.
CASSANDRA-6812 will help for writes. It in no way affects the concept of load shedding or
backpressure, just write latency spikes and hence timeouts.

There are some issues with load shedding by itself, though, at least as it stands in Cassandra.
Take your example cluster: A..E, with RF=3. Let's say B _and_ C are now load shedding. The
result is that we see 1/5th of requests queue up and block in A,D and E, which within a short
window results in all requests to those servers being blocked, as all rpc threads are waiting
for results that will never come. Now, with explicit back pressure (whether it be blocking
consumption of network messages or signalling the coordinator we are load shedding) we could
detect this problem and start dropping those requests on the floor in A, D and E and continue
to serve other requests.

FTR, we do already have an accidental back pressure system built into Incoming/OutboundTcpConnection,
with only negative consequences. If the remote server is blocked due to GC (or some other
event), it will not consume its IncomingTcpConnection which will cause our coordinator's queue
to that node to back up. This is something we should deal with, and introducing explicit back
pressure of the same kind when a node cannot cope with its load would allow us to deal with
both situations using the same mechanism. (We do sort of try to deal with this, but only as
more messages arrive, which is unlikely to happen when our client rpc threads get filled up
waiting for these responses)


was (Author: benedict):
Under these workload conditions, CASSANDRA-3852 is unlikely to help; it is more for reads.
CASSANDRA-6812 will help for writes. It in no way affects the concept of load shedding or
backpressure, just write latency spikes and hence timeouts.

There are some issues with load shedding by itself, though, at least as it stands in Cassandra.
Take your example cluster: A..E, with RF=3. Let's say B _and_ C are now load shedding. The
result is that we see 1/5th of requests queue up and block in A,D and E, which within a short
window results in all requests to those servers being blocked, as all rpc threads are waiting
for results that will never come. Now, with explicit back pressure (whether it be blocking
consumption of network messages or signalling the coordinator we are load shedding) we could
detect this problem and start dropping those requests on the floor in A, D and E and continue
to serve other requests.

FTR, we do already have an accidental back pressure system built into Incoming/OutboundTcpConnection,
with only negative consequences. If the remote server is blocked due to GC (or some other
event), it will not consume its IncomingTcpConnection which will cause our coordinator's queue
to that node to back up. This is something we should deal with, and introducing explicit back
pressure of the same kind when a node cannot cope with its load would allow us to deal with
both situations using the same mechanism.

> Apply backpressure gently when overloaded with writes
> -----------------------------------------------------
>
>                 Key: CASSANDRA-7937
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.0
>            Reporter: Piotr Kołaczkowski
>              Labels: performance
>
> When writing huge amounts of data into C* cluster from analytic tools like Hadoop or
Apache Spark, we can see that often C* can't keep up with the load. This is because analytic
tools typically write data "as fast as they can" in parallel, from many nodes and they are
not artificially rate-limited, so C* is the bottleneck here. Also, increasing the number of
nodes doesn't really help, because in a collocated setup this also increases number of Hadoop/Spark
nodes (writers) and although possible write performance is higher, the problem still remains.
> We observe the following behavior:
> 1. data is ingested at an extreme fast pace into memtables and flush queue fills up
> 2. the available memory limit for memtables is reached and writes are no longer accepted
> 3. the application gets hit by "write timeout", and retries repeatedly, in vain 
> 4. after several failed attempts to write, the job gets aborted 
> Desired behaviour:
> 1. data is ingested at an extreme fast pace into memtables and flush queue fills up
> 2. after exceeding some memtable "fill threshold", C* applies adaptive rate limiting
to writes - the more the buffers are filled-up, the less writes/s are accepted, however writes
still occur within the write timeout.
> 3. thanks to slowed down data ingestion, now flush can finish before all the memory gets
used
> Of course the details how rate limiting could be done are up for a discussion.
> It may be also worth considering putting such logic into the driver, not C* core, but
then C* needs to expose at least the following information to the driver, so we could calculate
the desired maximum data rate:
> 1. current amount of memory available for writes before they would completely block
> 2. total amount of data queued to be flushed and flush progress (amount of data to flush
remaining for the memtable currently being flushed)
> 3. average flush write speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message