Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6BAD4200B11 for ; Mon, 13 Jun 2016 23:52:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6A360160A3C; Mon, 13 Jun 2016 21:52:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B1D45160A1A for ; Mon, 13 Jun 2016 23:52:10 +0200 (CEST) Received: (qmail 10028 invoked by uid 500); 13 Jun 2016 21:52:04 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 10015 invoked by uid 99); 13 Jun 2016 21:52:04 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jun 2016 21:52:04 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id BE5522C0451 for ; Mon, 13 Jun 2016 21:52:04 +0000 (UTC) Date: Mon, 13 Jun 2016 21:52:04 +0000 (UTC) From: "Ariel Weisberg (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 13 Jun 2016 21:52:11 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 15328373#comment-15328373 ]=20 Ariel Weisberg edited comment on CASSANDRA-7937 at 6/13/16 9:51 PM: -------------------------------------------------------------------- I think we can make this situation better, and I mentioned some ideas at NG= CC and in CASSANDRA-11327. There are two issues. The first is that if flushing falls behind throughput falls to zero instead= of progressing at the rate at which flushing progresses which is usually n= ot zero. Right now it looks like it is zero because flushing doesn't releas= e any memory as it progresses and is all or nothing. Aleksey mentioned we could do something like early opening for flushing so = that memory is made available sooner. Alternatively we could overcommit and= then gradually release memory as flushing progresses. The second, and this isn't really related to backpressure, is that flushing= falls behind in several reasonable configurations. Ingest has gotten faste= r and I don't think flushing has as much so it's easier for it to fall behi= nd if it's driven by a single thread against a busy device (even a fast SSD= ). I haven't tested this yet, but I suspect that if you use multiple JBOD p= aths for a fast device like an SSD and increase memtable_flush_writers you = will get enough additional flushing throughput to keep up with ingest. Righ= t now flushing is single threaded for a single path and only one flush can = occur at any time. Flushing falling behind is more noticeable when you let compaction have mor= e threads and a bigger rate limit because it can dirty enough memory in the= filesystem cache that when it flushes it causes a temporally localized slo= wdown in flushing that is enough to cause timeouts when there is no more fr= ee memory because flushing didn't finish soon enough. I think the long term solution is that the further flushing falls behind th= e more concurrent flush threads we start deploying kind of like compaction = up to the configured limit. [Right now there is a single thread scheduling = them and waiting on the result.|https://github.com/apache/cassandra/blob/ca= ssandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130] = memtable_flush_writers doesn't help due to this code here that only [genera= tes more flush runnables for a memtable if there are multiple directories|h= ttps://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/c= assandra/db/Memtable.java#L278]. C* is already divvying up the heap using m= emtable_cleanup_threshold which would allow for concurrent flushing it's ju= st not actually flushing concurrently. was (Author: aweisberg): I think we can make this situation better, and I mentioned some ideas at NG= CC and in CASSANDRA-11327. There are two issues. The first is that if flushing falls behind throughput falls to zero instead= of progressing at the rate at which flushing progresses which is usually n= ot zero. Right now it looks like it is zero because flushing doesn't releas= e any memory as it progresses and is all or nothing. Aleksey mentioned we could do something like early opening for flushing so = that memory is made available sooner. Alternatively we could overcommit and= then gradually release memory as flushing progresses. The second is that flushing falls behind in several reasonable configuratio= ns. Ingest has gotten faster and I don't think flushing has as much so it's= easier for it to fall behind if it's driven by a single thread against a b= usy device (even a fast SSD). I haven't tested this yet, but I suspect that= if you use multiple JBOD paths for a fast device like an SSD and increase = memtable_flush_writers you will get enough additional flushing throughput t= o keep up with ingest. Right now flushing is single threaded for a single p= ath and only one flush can occur at any time. Flushing falling behind is more noticeable when you let compaction have mor= e threads and a bigger rate limit because it can dirty enough memory in the= filesystem cache that when it flushes it causes a temporally localized slo= wdown in flushing that is enough to cause timeouts when there is no more fr= ee memory because flushing didn't finish soon enough. I think the long term solution is that the further flushing falls behind th= e more concurrent flush threads we start deploying kind of like compaction = up to the configured limit. [Right now there is a single thread scheduling = them and waiting on the result.|https://github.com/apache/cassandra/blob/ca= ssandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130] = memtable_flush_writers doesn't help due to this code here that only [genera= tes more flush runnables for a memtable if there are multiple directories|h= ttps://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/c= assandra/db/Memtable.java#L278]. C* is already divvying up the heap using m= emtable_cleanup_threshold which would allow for concurrent flushing it's ju= st not actually flushing concurrently. > Apply backpressure gently when overloaded with writes > ----------------------------------------------------- > > Key: CASSANDRA-7937 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 > Project: Cassandra > Issue Type: Improvement > Environment: Cassandra 2.0 > Reporter: Piotr Ko=C5=82aczkowski > Labels: performance > > When writing huge amounts of data into C* cluster from analytic tools lik= e Hadoop or Apache Spark, we can see that often C* can't keep up with the l= oad. This is because analytic tools typically write data "as fast as they c= an" in parallel, from many nodes and they are not artificially rate-limited= , so C* is the bottleneck here. Also, increasing the number of nodes doesn'= t really help, because in a collocated setup this also increases number of = Hadoop/Spark nodes (writers) and although possible write performance is hig= her, the problem still remains. > We observe the following behavior: > 1. data is ingested at an extreme fast pace into memtables and flush queu= e fills up > 2. the available memory limit for memtables is reached and writes are no = longer accepted > 3. the application gets hit by "write timeout", and retries repeatedly, i= n vain=20 > 4. after several failed attempts to write, the job gets aborted=20 > Desired behaviour: > 1. data is ingested at an extreme fast pace into memtables and flush queu= e fills up > 2. after exceeding some memtable "fill threshold", C* applies adaptive ra= te limiting to writes - the more the buffers are filled-up, the less writes= /s are accepted, however writes still occur within the write timeout. > 3. thanks to slowed down data ingestion, now flush can finish before all = the memory gets used > Of course the details how rate limiting could be done are up for a discus= sion. > It may be also worth considering putting such logic into the driver, not = C* core, but then C* needs to expose at least the following information to = the driver, so we could calculate the desired maximum data rate: > 1. current amount of memory available for writes before they would comple= tely block > 2. total amount of data queued to be flushed and flush progress (amount o= f data to flush remaining for the memtable currently being flushed) > 3. average flush write speed -- This message was sent by Atlassian JIRA (v6.3.4#6332)