cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lohfink (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11752) histograms/metrics in 2.2 do not appear recency biased
Date Thu, 12 May 2016 14:23:13 GMT


Chris Lohfink commented on CASSANDRA-11752:

Well I will prefix this with I had nothing to do with any of the decisions, or have any real
say in this. I am just an observer who has been impacted by the changes a bit.

bq. If I understand your proposal correctly,
apparently not, Ill elaborate a few ideas I listed above

- change the reservoir to use hdr histogram. 
 -- This is how most of the metrics community resolves this (on ML)
it has the HdrHistogramResetOnSnapshotReservoir (can be implemented in EH too) that would
essentially do the deltas. Unfortunately when you read 1 attribute at a time this will cause
issues per codahales comment when people asked for the feature in metrics "Definitely not.
Concurrency and reset operations don't play nicely.".
 -- a lot of the push for using a non-lossy histograms vs the random samping reservoirs (pre
2.2) came up every time someone sees one of Gene Tills talks for the first time. So this would
make a lot of people happy
- Adding an exp decay to the EH
-- Can add forward decay to the values of EH buckets, actually pretty trivial to implement
(id be willing to give this a shot, also sound fun)
-- This would give the same "recent" view as the ExpDecayingReservoir without the randomness
that loses outliers.
-  Exposing the clear operation on the mbean, after reading if you clear the histogram it
would give you what your looking for really.
-- pre 2.2 this is how it worked for cfhistograms and such, there were two ways to read each
histogram, one that cleared and one that did not.
-- same comment as the ResetOnSnapshotReservoir above, this is easy to manage when doing it
programmatically but it fails with things like dumb jmx readers.

bq. If I understand your proposal correctly, you are saying it would be better to drop all
of that, much more powerful (and mathematically sound!) if we did an out of band export and
merge of all of the histograms and create a heatmap. This would provide better insight into
the distribution of values (by showing the full distribution instead of a handful of percentiles)
and allow for cluster wide aggregation. This could be further augmented by using hue and saturaiton
to call out latencies for individual nodes or column families. I think that sounds fantastic,
but that is very much not where the industry is today. Maybe Circonus can do that, but graphite
definitely can't.

For what its worth, thats more or less of what opscenter does. It still uses percentiles vs
heatmap for ease to conceptualize, but it generates percentiles on merged histograms vs the
_averaging_ of the percentile value (which apparently makes some people very very angry).
Thats not helpful here, but we shouldnt necessarily sacrifice the more accurate mechanism
either. There were very loud complaints with how latencies were reported before so I was pretty
glad to see the change in 2.2. I purpose we provide both.

> histograms/metrics in 2.2 do not appear recency biased
> ------------------------------------------------------
>                 Key: CASSANDRA-11752
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Chris Burroughs
>              Labels: metrics
>         Attachments: boost-metrics.png, c-jconsole-comparison.png, c-metrics.png, default-histogram.png
> In addition to upgrading to metrics3, CASSANDRA-5657 switched to using  a custom histogram
implementation.  After upgrading to Cassandra 2.2 histograms/timer metrics are not suspiciously
flat.  To be useful for graphing and alerting metrics need to be biased towards recent events.
> I have attached images that I think illustrate this.
>  * The first two are a comparison between latency observed by a C* 2.2 (us) cluster shoring
very flat lines and a client (using metrics 2.2.0, ms) showing server performance problems.
 We can't rule out with total certainty that something else isn't the cause (that's why we
measure from both the client & server) but they very rarely disagree.
>  * The 3rd image compares jconsole viewing of metrics on a 2.2 and 2.1 cluster over several
minutes.  Not a single digit changed on the 2.2 cluster.

This message was sent by Atlassian JIRA

View raw message