cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8692) Coalesce intra-cluster network messages
Date Sat, 07 Feb 2015 00:01:36 GMT


Benedict commented on CASSANDRA-8692:

This is the kind of discussion it would really help to have over a whiteboard, or perhaps
over video chat secondarily. I'll do my best to explain my position, but it's verbose and
it seems I already failed to portray it the first time.

bq. The timing of providing messages to the average doesn't matter since the timestamp is
generated by the submitter not the sending thread

It matters because this moving average is over a fixed number of messages, rather than a fixed
time interval. So the order of provision wrt querying matters a lot, because a (non-uniformly-distributed)
close clustering of messages will have smaller than average delta between them. A single batch
you process at once is likely to fit this bill. The converse is also true; the first message
in a batch is more likely to be abnormally distant from the prior messages. So your implementation
depends on the assumption that the message arrival is uniform, just implicitly :)

The main thing is that the non-uniformity of message arrival is highly unlikely to average
out over 16 messages, whereas over a 1s time horizon it is more likely, and so the assumption
is perhaps likely to lead to more reasonable calculations? With a 150K/s message arrival rates
(estimated from your numbers), that means the average is calculated over, on average, 100us,
which is a very short time horizon to extract any predictive power.

bq. I am on the fence as to whether something smarter than a moving average counts as scope

I'm still only suggesting a moving average, just one measured over a fixed time horizon, instead
of a fixed number of measurements. I'm also suggesting a different strategy for utilising
this calculation, one that uses more information at our disposal. The two suggestions are
somewhat orthogonal, and you could implement each change independently. Right now your algorithm
is: average over most recent 16 messages, and wait as long as the average delta between the
last 16, which we hope gives us one more message to coalesce. Mine is 1) to make the average
calculation more robust to fluctuations in arrival rates (but still decay rapidly); and 2)
make a decision on whether to coalesce based on the amount of potential "win" we will get;
as the number of messages we have to flush grows the benefit to waiting declines, and we also
potentially wait longer if there is the expectation we can flush a significant cluster of
messages at once. Thus it should (theoretically) lead to less delay when unhelpful, and more
when helpful. 

bq. It seems like trying to do better would help bring down average latency by 100, maybe
200, microseconds when latency is 2+ milliseconds.

Well, my aim isn't so much to bring down latency further but to make the algorithm understandable
to me. It strikes me that its behaviour isn't necessarily emerging from its design, but from
other correlated factors. Take a random data point in your results for illustration: 400K
messages with a coalesce window of 100us; guessing your RF=5 and CL=ALL from your other comments,
it looks like each node would have 150K/s traffic, which translates to ~7us average latency
between message arrival. Which means you should only be waiting 7us to coalesce based on your
algorithm, on average, but then why is it so much faster than a maximum coalesce window of
6us, 12us and 25us? That means it's regularly predicting an average message delay of 100us,
which is not correct, so it's not clear to me what's actually driving its results. My concern,
as a result, is that this might not actually translate into as much advantage in the field
as it does in the lab, in which we do produce a very steady load.

My handwavy suspicion right now is that the non-uniformity of message arrivals leads us to
some kind of phased work completion, where the delta between the phases is large enough to
tip our calculation to delay, which steadily leads to phased work provision from the client
since they behave synchronously. This might also explain why it takes a few minutes for the
cluster to reach its performance steady state - which doesn't otherwise make much sense, which
you also highlighted your confusion over.

Either way, fewer unecessary delays also translates into higher throughput, not just reduced
latency. However mostly I want a beast I can explain all of the properties and behaviours

> Coalesce intra-cluster network messages
> ---------------------------------------
>                 Key: CASSANDRA-8692
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.4
>         Attachments: batching-benchmark.png
> While researching CASSANDRA-8457 we found that it is effective and can be done without
introducing additional latency at low concurrency/throughput.
> The patch from that was used and found to be useful in a real life scenario so I propose
we implement this in 2.1 in addition to 3.0.
> The change set is a single file and is small enough to be reviewable.

This message was sent by Atlassian JIRA

View raw message