cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation
Date Thu, 30 Jun 2016 20:33:10 GMT


Jason Brown commented on CASSANDRA-11738:

The assumption you are making is that the SEVERITY is somehow instantaneously known throughout
the cluster, and will be promptly applied uniformly. In a large cluster, this will take a
while to propagate via the existing gossip mechanism.

bq.  "don't read from this node unless you have to"

Using SEVERITY for indicating this state seems the wrong mechanism to achieve this. At a minimum
it could be a different state in the gossip metadata. Won't using correctly calculated latencies
tell a node enough to avoid a given peer? If you really need a node to not be bothered by
any peers, why not just disable gossip? Peers will mark it down via the {{FailureDetector}}.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> ---------------------------------------------------------------------
>                 Key: CASSANDRA-11738
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jeremiah Jordan
>             Fix For: 3.x
> CASSANDRA-11737 was opened to allow completely disabling the use of severity in the DynamicEndpointSnitch
calculation, but that is a pretty big hammer.  There is probably something we can do to better
use the score.
> The issue seems to be that severity is given equal weight with latency in the current
code, also that severity is only based on disk io.  If you have a node that is CPU bound on
something (say catching up on LCS compactions because of bootstrap/repair/replace) the IO
wait can be low, but the latency to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help in the cases
I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  Now that
we have rapid read protection, maybe just using latency is enough, as it can help where the
predictive nature of IO wait would have been useful.

This message was sent by Atlassian JIRA

View raw message