cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kołaczkowski (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
Date Mon, 29 Dec 2014 09:55:14 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259967#comment-14259967
] 

Piotr Kołaczkowski edited comment on CASSANDRA-7937 at 12/29/14 9:54 AM:
-------------------------------------------------------------------------

I like the idea of using parallelism level as a backpressure mechanism. That would have a
nice positive effect of automatically reducing the amount of memory used for queuing the requests.


However, I have a few concerns:
# my biggest concern is, that even limiting a single client to one write at a time (window
size = 1), might still be too fast, for some fast clients, if only row sizes are big enough,
particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell.
Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're
able to flush, then we still have a problem. 
# when not using TokenAware LBP, we can have a problem that in case of unbalanced write load,
the fast (not loaded) nodes would be needlessly penalized with lower parallelism level, because
the window size would be per-coordinator, not per partition (replica set). Am I right here?

I wonder if using a penalty delay *after* processing the write request, would not be a better
idea: in case of an unbalanced load, when one partition gets hammered, but most others are
ok, it would slow down writes for that one partition (replica set), but would not affect latency
of writes of other partitions. I'm for applyting the delay after, because then we already
know the replica set and their load, as well as we don't need to keep data queued in memory,
for it has already been written. 

The (simplified) process would look as follows:
# the write request gets accepted by the coordinator
# the write gets sent to proper replica nodes
# the replicas that acknowledge the write, also reply with their current load information
#  received load information gets averaged (median)
#  when the CL is satisfied, but the load was high enough to be in the "dangerous zone" the
coordinator puts some artificial delay before acknowledging the write to the client - of course
small enough to not exceed the write timeout. Or if not possible / wise for some other reasons
(e.g. holding memory), just tells the driver that "load was high, you better slow down" and
the driver waits before processing the next request.

WDYT?



was (Author: pkolaczk):
I like the idea of using parallelism level as a backpressure mechanism. That would have a
nice positive effect of automatically reducing the amount of memory used for queuing the requests.


However, I have a few concerns:

1. my biggest concern is, that even limiting a single client to one write at a time (window
size = 1), might still be too fast, for some fast clients, if only row sizes are big enough,
particularly when writing big cells of data, where big = hundreds of kB / single MBs per cell.
Cassandra is extremely efficient at ingesting data into memtables. If it was faster than we're
able to flush, then we still have a problem. 

2. when not using TokenAware LBP, we can have a problem that in case of unbalanced write load,
the fast (not loaded) nodes would be needlessly penalized with lower parallelism level, because
the window size would be per-coordinator, not per partition (replica set). Am I right here?

I wonder if using a penalty delay *after* processing the write request, would not be a better
idea: in case of an unbalanced load, when one partition gets hammered, but most others are
ok, it would slow down writes for that one partition (replica set), but would not affect latency
of writes of other partitions. I'm for applyting the delay after, because then we already
know the replica set and their load, as well as we don't need to keep data queued in memory,
for it has already been written. 

The (simplified) process would look as follows:
1. the write request gets accepted by the coordinator
2. the write gets sent to proper replica nodes
3. the replicas that acknowledge the write, also reply with their current load information
4. received load information gets averaged (median)
5. when the CL is satisfied, but the load was high the coordinator puts some artificial delay
before acknowledging the write to the client - of course small enough to not exceed the write
timeout. Or if not possible / wise for some other reasons (e.g. holding memory), just tells
the driver that "load was high, you better slow down" and the driver waits before processing
the next request.

WDYT?
WDYT?



> Apply backpressure gently when overloaded with writes
> -----------------------------------------------------
>
>                 Key: CASSANDRA-7937
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.0
>            Reporter: Piotr Kołaczkowski
>              Labels: performance
>
> When writing huge amounts of data into C* cluster from analytic tools like Hadoop or
Apache Spark, we can see that often C* can't keep up with the load. This is because analytic
tools typically write data "as fast as they can" in parallel, from many nodes and they are
not artificially rate-limited, so C* is the bottleneck here. Also, increasing the number of
nodes doesn't really help, because in a collocated setup this also increases number of Hadoop/Spark
nodes (writers) and although possible write performance is higher, the problem still remains.
> We observe the following behavior:
> 1. data is ingested at an extreme fast pace into memtables and flush queue fills up
> 2. the available memory limit for memtables is reached and writes are no longer accepted
> 3. the application gets hit by "write timeout", and retries repeatedly, in vain 
> 4. after several failed attempts to write, the job gets aborted 
> Desired behaviour:
> 1. data is ingested at an extreme fast pace into memtables and flush queue fills up
> 2. after exceeding some memtable "fill threshold", C* applies adaptive rate limiting
to writes - the more the buffers are filled-up, the less writes/s are accepted, however writes
still occur within the write timeout.
> 3. thanks to slowed down data ingestion, now flush can finish before all the memory gets
used
> Of course the details how rate limiting could be done are up for a discussion.
> It may be also worth considering putting such logic into the driver, not C* core, but
then C* needs to expose at least the following information to the driver, so we could calculate
the desired maximum data rate:
> 1. current amount of memory available for writes before they would completely block
> 2. total amount of data queued to be flushed and flush progress (amount of data to flush
remaining for the memtable currently being flushed)
> 3. average flush write speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message