zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jtuple <...@git.apache.org>
Subject [GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Date Fri, 03 Aug 2018 18:34:33 GMT
Github user jtuple commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    I'm happy to add tests for `AvgMinMaxCounter` and `SimpleCounter`, will go ahead and update
this PR today to include that for completeness. Since it's easy change to make while we discuss
any other reworkings.
    
    We actually landed a version of #440 internally at Facebook and have been running with
Dropwiz-based percentile counters in production for a few months now. We build on the same
`ServerMetrics` design in this PR and have an `AvgMinMaxPercentileCounter` that wraps a Dropwiz
`Histogram` which uses a custom uniform-sampling reservoir implementation that enables resetting
the metric/reservoir. This makes things consistent with all the other metric types that are
resettable.
    
    In practice, the memory overhead for the Dropwiz histogram is much higher than the flat
`AvgMinMaxCounter` metric, so we only converted about 20 of our 100 metrics to `AvgMinMaxPercentileCounter`
metrics -- pretty much just the latency metrics and metrics that track time spent in various
queues and stages of the request pipeline.
    
    We also have `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet` metric-types that
aggregate metrics across a dynamic set of entities. These are types that we were planning
to submit in future PRs that build on this one. Example uses of this includes tracking learner
queue size/time and ack latency on the leader for each quorum peer.
    
    ---
    
    In any case, happy to rework this PR with guidance and have the bandwidth to do so if
we can come to consensus on a design. My biggest concern is that pretty much all of our remaining
internal features at Facebook that we want to upstream are blocked on this feature, since
we aggressively add metrics when adding new features (so that we can monitor/track in production,
etc). The sooner we can land something, the sooner we can rebase on top of whatever the agreed
approach is and unblock our other contributions.
    
    Open to any design that the community is happy with, however I think there's a few things
which are important to maintain:
    
    1. Any solution should make it super easy to add new metrics and use them throughout the
codebase. We've learned the hard way that lowering the barrier to entry was the only way to
truly motivate adding new metrics while adding new features.
    
    2. It's useful to have a `Metric` type interface that allows us to enforce certain requirements.
As an example, ZooKeeper metrics have historically been resettable, which isn't necessarily
true for other metric libraries. Having an adaptor class allowed us to wrap the Dropwiz histograms
and add a fast reset capability that then behaved like all our other metrics.
    
    3. Even if we export to external systems, having the ability to have Zookeeper maintain
internal metrics/aggregation is still important. There should be a low-barrier to entry for
getting visibility into the system from day one with minimal config/setup.


---

Mime
View raw message