Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 13 Jun 2016 21:52:04 +0000 (UTC)
From: "Ariel Weisberg (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12741464.1410785101000.1381.1465854724776@Atlassian.JIRA>
In-Reply-To: <JIRA.12741464.1410785101000@Atlassian.JIRA>
References: <JIRA.12741464.1410785101000@Atlassian.JIRA> <JIRA.12741464.1410785101718@arcas>
Subject: [jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently
 when overloaded with writes
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Mon, 13 Jun 2016 21:52:11 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
15328373#comment-15328373 ]=20

Ariel Weisberg edited comment on CASSANDRA-7937 at 6/13/16 9:51 PM:
--------------------------------------------------------------------

I think we can make this situation better, and I mentioned some ideas at NG=
CC and in CASSANDRA-11327.

There are two issues.

The first is that if flushing falls behind throughput falls to zero instead=
 of progressing at the rate at which flushing progresses which is usually n=
ot zero. Right now it looks like it is zero because flushing doesn't releas=
e any memory as it progresses and is all or nothing.

Aleksey mentioned we could do something like early opening for flushing so =
that memory is made available sooner. Alternatively we could overcommit and=
 then gradually release memory as flushing progresses.

The second, and this isn't really related to backpressure, is that flushing=
 falls behind in several reasonable configurations. Ingest has gotten faste=
r and I don't think flushing has as much so it's easier for it to fall behi=
nd if it's driven by a single thread against a busy device (even a fast SSD=
). I haven't tested this yet, but I suspect that if you use multiple JBOD p=
aths for a fast device like an SSD and increase memtable_flush_writers you =
will get enough additional flushing throughput to keep up with ingest. Righ=
t now flushing is single threaded for a single path and only one flush can =
occur at any time.

Flushing falling behind is more noticeable when you let compaction have mor=
e threads and a bigger rate limit because it can dirty enough memory in the=
 filesystem cache that when it flushes it causes a temporally localized slo=
wdown in flushing that is enough to cause timeouts when there is no more fr=
ee memory because flushing didn't finish soon enough.

I think the long term solution is that the further flushing falls behind th=
e more concurrent flush threads we start deploying kind of like compaction =
up to the configured limit. [Right now there is a single thread scheduling =
them and waiting on the result.|https://github.com/apache/cassandra/blob/ca=
ssandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130] =
memtable_flush_writers doesn't help due to this code here that only [genera=
tes more flush runnables for a memtable if there are multiple directories|h=
ttps://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/c=
assandra/db/Memtable.java#L278]. C* is already divvying up the heap using m=
emtable_cleanup_threshold which would allow for concurrent flushing it's ju=
st not actually flushing concurrently.


was (Author: aweisberg):
I think we can make this situation better, and I mentioned some ideas at NG=
CC and in CASSANDRA-11327.

There are two issues.

The first is that if flushing falls behind throughput falls to zero instead=
 of progressing at the rate at which flushing progresses which is usually n=
ot zero. Right now it looks like it is zero because flushing doesn't releas=
e any memory as it progresses and is all or nothing.

Aleksey mentioned we could do something like early opening for flushing so =
that memory is made available sooner. Alternatively we could overcommit and=
 then gradually release memory as flushing progresses.

The second is that flushing falls behind in several reasonable configuratio=
ns. Ingest has gotten faster and I don't think flushing has as much so it's=
 easier for it to fall behind if it's driven by a single thread against a b=
usy device (even a fast SSD). I haven't tested this yet, but I suspect that=
 if you use multiple JBOD paths for a fast device like an SSD and increase =
memtable_flush_writers you will get enough additional flushing throughput t=
o keep up with ingest. Right now flushing is single threaded for a single p=
ath and only one flush can occur at any time.

Flushing falling behind is more noticeable when you let compaction have mor=
e threads and a bigger rate limit because it can dirty enough memory in the=
 filesystem cache that when it flushes it causes a temporally localized slo=
wdown in flushing that is enough to cause timeouts when there is no more fr=
ee memory because flushing didn't finish soon enough.

I think the long term solution is that the further flushing falls behind th=
e more concurrent flush threads we start deploying kind of like compaction =
up to the configured limit. [Right now there is a single thread scheduling =
them and waiting on the result.|https://github.com/apache/cassandra/blob/ca=
ssandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130] =
memtable_flush_writers doesn't help due to this code here that only [genera=
tes more flush runnables for a memtable if there are multiple directories|h=
ttps://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/c=
assandra/db/Memtable.java#L278]. C* is already divvying up the heap using m=
emtable_cleanup_threshold which would allow for concurrent flushing it's ju=
st not actually flushing concurrently.

> Apply backpressure gently when overloaded with writes
> -----------------------------------------------------
>
>                 Key: CASSANDRA-7937
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937
>             Project: Cassandra
>          Issue Type: Improvement
>         Environment: Cassandra 2.0
>            Reporter: Piotr Ko=C5=82aczkowski
>              Labels: performance
>
> When writing huge amounts of data into C* cluster from analytic tools lik=
e Hadoop or Apache Spark, we can see that often C* can't keep up with the l=
oad. This is because analytic tools typically write data "as fast as they c=
an" in parallel, from many nodes and they are not artificially rate-limited=
, so C* is the bottleneck here. Also, increasing the number of nodes doesn'=
t really help, because in a collocated setup this also increases number of =
Hadoop/Spark nodes (writers) and although possible write performance is hig=
her, the problem still remains.
> We observe the following behavior:
> 1. data is ingested at an extreme fast pace into memtables and flush queu=
e fills up
> 2. the available memory limit for memtables is reached and writes are no =
longer accepted
> 3. the application gets hit by "write timeout", and retries repeatedly, i=
n vain=20
> 4. after several failed attempts to write, the job gets aborted=20
> Desired behaviour:
> 1. data is ingested at an extreme fast pace into memtables and flush queu=
e fills up
> 2. after exceeding some memtable "fill threshold", C* applies adaptive ra=
te limiting to writes - the more the buffers are filled-up, the less writes=
/s are accepted, however writes still occur within the write timeout.
> 3. thanks to slowed down data ingestion, now flush can finish before all =
the memory gets used
> Of course the details how rate limiting could be done are up for a discus=
sion.
> It may be also worth considering putting such logic into the driver, not =
C* core, but then C* needs to expose at least the following information to =
the driver, so we could calculate the desired maximum data rate:
> 1. current amount of memory available for writes before they would comple=
tely block
> 2. total amount of data queued to be flushed and flush progress (amount o=
f data to flush remaining for the memtable currently being flushed)
> 3. average flush write speed


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)