Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Sun, 13 Dec 2015 02:37:46 +0000 (UTC)
From: "Ariel Weisberg (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12827721.1430931031000.7923.1449974266864@Atlassian.JIRA>
In-Reply-To: <JIRA.12827721.1430931031000@Atlassian.JIRA>
References: <JIRA.12827721.1430931031000@Atlassian.JIRA>
 <JIRA.12827721.1430931031849@arcas>
Subject: [jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight
 requests at the coordinator
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054761#comment-15054761 ] 

Ariel Weisberg commented on CASSANDRA-9318:
-------------------------------------------

I got two cstar jobs to complete.

[This job is set to allow 16 megabytes of transactions per coordinator, and disabled reads until they come back down to 12 megabytes.|http://cstar.datastax.com/graph?command=one_job&stats=d1e720c8-a125-11e5-9051-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=6664.35&ymin=0&ymax=11883.3]
[This job is set to allow 64 megabytes of transactions per coordinator, and disabled reads until they came back down to 60 megabytes.|http://cstar.datastax.com/graph?command=one_job&stats=26853362-a127-11e5-80c2-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=322.85&ymin=0&ymax=12972.3]

The job with 64 megabytes in flight kind of looks like it failed after 300 seconds. I didn't expect the threshold for things to fall apart to be quite that low, but generally speaking yeah more data in flight tends to cause bad things to happen.

So why did the second one fall apart? First off mad props to whomever started collecting the GC logs. Lot's of continual full GC at the end. Sure enough the heap is only 1 gigabyte. Are we seriously running all our performance tests with a default heap of 1 gigabyte?

I don't think it failed due to in flight requests (only had 32 megabytes in flight). I think it up OOMed due to other heap pressure. For this in-flight request backpressure to work I think we need to include the weight of memtables when making the decision. I am going to bump up the heap and try again to see if I can reduce the impact of other heap pressure to the point that we can start buffering more requests in flight.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x, 2.2.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other issues.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)