cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Barfield (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning
Date Mon, 06 Jan 2014 19:36:51 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863309#comment-13863309
] 

Ian Barfield commented on CASSANDRA-6465:
-----------------------------------------

I believe the purpose of time penalty was to more quickly detect problematic nodes. If a node
was suddenly suffering severe issues, that wouldn't be reflected in its latency metric until
the current outstanding queries resolved. That might take until the maximum duration timeout
which can be arbitrarily long, and in many cases is a lot longer than you'd like. By using
timeDelay, the snitch can somewhat immediately penalize problem nodes since the queries do
not have to timeout first. That said, it has numerous flaws both conceptually and in its implementation.

I was working on this problem a couple weeks ago, but have been distracted since, so I might
not be able to give the best summary. Here's a couple issues off the top of my head though:
- if the time delay values are low, then high jitter throws the scores way off. It isn't unreasonable
to expect situations where the time delay shifts semi-randomly between 0 and 1 ms. This means
very little in terms of whether a node is a suitable target but can cause a drastic difference
in scores if there is no slow node to anchor the scores.
- if the node response periods aren't low; say they average around 50 ms. Then by definition
they are highly random since the score could be calculated at any point along 0 to 50 ms.
- it has a lot of complex interactions outside of its original scope of detecting bad nodes
- when calculating scores, if there is no lastReceived value for a node (eg. the node has
just been added to the cluster), then the logic defaults to using the current time (essentially
0 or maximum 'good'). You might instead take the view that an unproven, cache-cold node would
be a bad selection.
- sensitive to local noise. Each time the score is calculated, the timePenalty is calculated
fresh. Since there is no concept of persistance or scope, events that corrupt the scoring
process are extra harmful. eg. GC, CPU load / thread scheduling, and concurrency shenanigans
occuring between the lastReceived.get() and System.currentTimeMillis()

Some of these issues are somewhat alleviated by the switch to using nanos, and I've been tempted
to back port that for this class at least for testing, but this logic fails in complex ways.
I think at some point I was able to confirm some wildly fluctuating values of the subcomponents
to the scores (specifically timePenalty) by checking the mbeans and working under the assumption
that timePenalty was likely the only component to well rounded scores -- if you have at least
one node with >> timePenalty then it gets cut off to UPDATE_INTERVAL_IN_MS which as
a divisor makes for nicely formed floating point numbers.

There are also a lot of issues with the other score components, and some of the overall logic,
but... some other time. Apologies if i've gotten something quite wrong; I've never really
used Cassandra.

> DES scores fluctuate too much for cache pinning
> -----------------------------------------------
>
>                 Key: CASSANDRA-6465
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6465
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 1.2.11, 2 DC cluster
>            Reporter: Chris Burroughs
>            Assignee: Tyler Hobbs
>            Priority: Minor
>              Labels: gossip
>             Fix For: 2.0.5
>
>         Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py
>
>
> To quote the conf:
> {noformat}
> # if set greater than zero and read_repair_chance is < 1.0, this will allow
> # 'pinning' of replicas to hosts in order to increase cache capacity.
> # The badness threshold will control how much worse the pinned host has to be
> # before the dynamic snitch will prefer other replicas over it.  This is
> # expressed as a double which represents a percentage.  Thus, a value of
> # 0.2 means Cassandra would continue to prefer the static snitch values
> # until the pinned host was 20% worse than the fastest.
> dynamic_snitch_badness_threshold: 0.1
> {noformat}
> An assumption of this feature is that scores will vary by less than dynamic_snitch_badness_threshold
during normal operations.  Attached is the result of polling a node for the scores of 6 different
endpoints at 1 Hz for 15 minutes.  The endpoints to sample were chosen with `nodetool getendpoints`
for row that is known to get reads.  The node was acting as a coordinator for a few hundred
req/second, so it should have sufficient data to work with.  Other traces on a second cluster
have produced similar results.
>  * The scores vary by far more than I would expect, as show by the difficulty of seeing
anything useful in that graph.
>  * The difference between the best and next-best score is usually > 10% (default dynamic_snitch_badness_threshold).
> Neither ClientRequest nor ColumFamily metrics showed wild changes during the data gathering
period.
> Attachments:
>  * jython script cobbled together to gather the data (based on work on the mailing list
from Maki Watanabe a while back)
>  * csv of DES scores for 6 endpoints, polled about once a second
>  * Attempt at making a graph



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message