cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dikang Gu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-14252) Use zero as default score in DynamicEndpointSnitch
Date Wed, 21 Feb 2018 23:48:00 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dikang Gu updated CASSANDRA-14252:
----------------------------------
    Description: 
The problem I want to solve is that I found in our deployment, one slow but alive data node
can slow down the whole cluster, even caused timeout of our requests. 

We are using DynamicEndpointSnitch, with badness_threshold 0.1. I expect the DynamicEndpointSnitch
switch to sortByProximityWithScore, if local data node latency is too high.

I added some debug log, and figured out that in a lot of cases, the score from remote data
node was not populated, so the fallback to sortByProximityWithScore never happened. That's
why a single slow data node, can cause huge problems to the whole cluster.

In this jira, I'd like to use zero as default score, so that we will get a chance to try remote
data node, if local one is slow. 

I tested it in our test cluster, it improved the client latency in single slow data node case
significantly.  

I flag this as a Bug, because it caused problems to our use cases multiple times.

 ==== logs ===

_2018-02-21_23:08:57.54145 WARN 23:08:57 [RPC-Thread:978]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]_
 _2018-02-21_23:08:57.54319 WARN 23:08:57 [RPC-Thread:967]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [ip1, ip2], with scores [0.0]_
 _2018-02-21_23:08:57.55111 WARN 23:08:57 [RPC-Thread:453]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]_
 _2018-02-21_23:08:57.55687 WARN 23:08:57 [RPC-Thread:753]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]_

 

 

 

  was:
The problem I want to solve is that I found in our deployment, one slow but alive data node
can slow down the whole cluster, even caused timeout of our requests. 

We are using DynamicEndpointSnitch, with badness_threshold 0.1. I expect the DynamicEndpointSnitch
switch to sortByProximityWithScore, if local data node latency is too high.

I added some debug log, and figured out that in a lot of cases, the score from remote data
node was not populated, so the fallback to sortByProximityWithScore never happened. That's
why a single slow data node, can cause huge problems to the whole cluster.

In this jira, I'd like to use zero as default score, so that we will get a chance to try remote
data node, if local one is slow. 

I tested it in our test cluster, it improved the client latency in single slow data node case
significantly.  

I flag this as a Bug, because it caused problems to our use cases multiple times.

 ==== logs ===

_2018-02-21_23:08:57.54145 WARN 23:08:57 [RPC-Thread:978]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [/2401:db00:30:5113:face:0:2d:0, /2401:db00:1030:90fb:face:0:5:0],
with scores [1.0]_
_2018-02-21_23:08:57.54319 WARN 23:08:57 [RPC-Thread:967]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [/2401:db00:30:510d:face:0:5:0, /2401:db00:1030:a119:face:0:b:0],
with scores [0.0]_
_2018-02-21_23:08:57.55111 WARN 23:08:57 [RPC-Thread:453]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [/2401:db00:30:5113:face:0:2d:0, /2401:db00:1030:a119:face:0:b:0],
with scores [1.0]_
_2018-02-21_23:08:57.55687 WARN 23:08:57 [RPC-Thread:753]: sortByProximityWithBadness: after
sorting by proximity, addresses order change to [/2401:db00:30:5113:face:0:2d:0, /2401:db00:1030:90fb:face:0:5:0],
with scores [1.0]_

 

 

 


> Use zero as default score in DynamicEndpointSnitch
> --------------------------------------------------
>
>                 Key: CASSANDRA-14252
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14252
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Dikang Gu
>            Assignee: Dikang Gu
>            Priority: Major
>             Fix For: 3.11.x
>
>
> The problem I want to solve is that I found in our deployment, one slow but alive data
node can slow down the whole cluster, even caused timeout of our requests. 
> We are using DynamicEndpointSnitch, with badness_threshold 0.1. I expect the DynamicEndpointSnitch
switch to sortByProximityWithScore, if local data node latency is too high.
> I added some debug log, and figured out that in a lot of cases, the score from remote
data node was not populated, so the fallback to sortByProximityWithScore never happened.
That's why a single slow data node, can cause huge problems to the whole cluster.
> In this jira, I'd like to use zero as default score, so that we will get a chance to
try remote data node, if local one is slow. 
> I tested it in our test cluster, it improved the client latency in single slow data
node case significantly.  
> I flag this as a Bug, because it caused problems to our use cases multiple times.
>  ==== logs ===
> _2018-02-21_23:08:57.54145 WARN 23:08:57 [RPC-Thread:978]: sortByProximityWithBadness:
after sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]_
>  _2018-02-21_23:08:57.54319 WARN 23:08:57 [RPC-Thread:967]: sortByProximityWithBadness:
after sorting by proximity, addresses order change to [ip1, ip2], with scores [0.0]_
>  _2018-02-21_23:08:57.55111 WARN 23:08:57 [RPC-Thread:453]: sortByProximityWithBadness:
after sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]_
>  _2018-02-21_23:08:57.55687 WARN 23:08:57 [RPC-Thread:753]: sortByProximityWithBadness:
after sorting by proximity, addresses order change to [ip1, ip2], with scores [1.0]_
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message