Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 8 May 2015 19:40:00 +0000 (UTC)
From: "Jonathan Ellis (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12827721.1430931031000.55536.1431114000987@Atlassian.JIRA>
In-Reply-To: <JIRA.12827721.1430931031000@Atlassian.JIRA>
References: <JIRA.12827721.1430931031000@Atlassian.JIRA>
 <JIRA.12827721.1430931031849@arcas>
Subject: [jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight
 requests at the coordinator
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535356#comment-14535356 ] 

Jonathan Ellis commented on CASSANDRA-9318:
-------------------------------------------

bq. it sounds like Jonathan is suggesting we simply prune our ExpiringMap based on bytes tracked as well as time?

No, I'm suggesting we abort requests more aggressively with OverloadedException *before sending them to replicas*.  One place this might make sense is sendToHintedEndpoints, where we already throw OE.

Right now we only throw OE once we start writing hints for a node that is in trouble.  This doesn't seem to be aggressive enough.  (Although, most of our users are on 2.0 where we allowed 8x as many hints in flight before starting to throttle.)

So, I am suggesting we also track requests outstanding (perhaps with the ExpiringMap as you suggest) as well and stop accepting requests once we hit a reasonable limit of "you can't possibly process more requests than this in parallel."

> The ExpiringMap requests are already "in-flight" and cannot be cancelled, so their effect on other nodes cannot be rescinded, and imposing a limit does not stop us issuing more requests to the nodes in the cluster that are failing to keep up and respond to us.

Right, and I'm fine with that.  The goal is not to keep the replica completely out of trouble.  The goal is to keep the coordinator from falling over from buffering EM and MessagingService entries that it can't drain fast enough.  Secondarily, this will help the replica too because our existing load shedding is fine at recovering from temporary spikes in load.  But our load shedding isn't good enough to save it when the coordinators keep throwing more at it when it's already overwhelmed.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests and if it reaches a high watermark disable read on client connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other issues.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)