storm-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From srdo <...@git.apache.org>
Subject [GitHub] storm pull request #2200: STORM-2616: Documentation for built in metrics
Date Mon, 10 Jul 2017 20:52:42 GMT
Github user srdo commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2200#discussion_r126523893
  
    --- Diff: docs/Metrics.md ---
    @@ -125,3 +126,193 @@ The [builtin metrics]({{page.git-blob-base}}/storm-client/src/jvm/org/apache/sto
     
     [BuiltinMetricsUtil.java]({{page.git-blob-base}}/storm-client/src/jvm/org/apache/storm/daemon/metrics/BuiltinMetricsUtil.java)
sets up data structures for the built-in metrics, and facade methods that the other framework
components can use to update them. The metrics themselves are calculated in the calling code
-- see for example [`ackSpoutMsg`]({{page.git-blob-base}}/storm-client/src/jvm/org/apache/storm/executor/Executor.java).
     
    +#### Reporting Rate
    +
    +The rate at which built in metrics are reported is configurable through the `topology.builtin.metrics.bucket.size.secs`
metric.  If you set this too low it can overload the consumers
    +and some metrics consumers expect metrics to show up at a fixed rate or the numbers could
be off, so please use caution when modifying this.
    +
    +
    +#### Tuple Counting Metrics
    +
    +There are several different metrics related to counting what a bolt or spout does to
a tuple. These include things like emitting, transferring, acking, and failing of tuples.
    +
    +In general all of these tuple count metrics are randomly sub-sampled unless otherwise
state.  This means that the counts you see both on the UI and from the built in metrics are
not necessarily exact.  In fact by default we sample only 5% of the events and estimate the
total number of events from that.  The sampling percentage is configurable per topology through
the `topology.stats.sample.rate` config.  Setting it to 1.0 will make the counts exact, but
be aware that the more events we sample the slower your topology will run (as the metrics
are counted on the critical path).  This is why we have a 5% sample rate as the default.
    +
    +The tuple counting metrics are generally reported as maps unless explicitly stated otherwise.
 They break down each count for finer grained reporting.
    +The keys to these maps fall into two categories `"${stream_name}"` or `"${upstream_component}:${stream_name}"`.
 The former is used for all spout metrics and for outgoing bolt metrics (`__emit-count` and
`__transfer-count`).  The later is used for bolt metrics that deal with incoming tuples.
    +
    +So for a word count topology the count bolt might show something like the following for
an `__ack-count` metrics
    +
    +```
    +{
    +    "split:default": 80080
    +}
    +```
    +
    +But the spout would show something more like for the same metric.
    +
    +```
    +{
    +    "default": 12500
    +}
    +```
    +
    +
    +##### `__ack-count`
    +
    +For bolts it is the number of incoming tuples that had the `ack` method called on them.
 For spouts it is the number of tuples that were fully acked.  If acking is disabled this
metric is still reported, but it is not really meaningful.
    +
    +##### `__fail-count`
    +
    +For bolts this is the number of incoming tuples that had the `fail` method called on
them.  For spouts this is the number of tuples that failed.  It could be because of a tuple
timing out or it could be because a bolt called fail on it.  The two are not separated out.
    +
    +##### `__emit-count`
    +
    +This is the total number of times the `emit` method was called to send a tuple.  This
is the same for both bolts and spouts.
    +
    +##### `__transfer-count`
    +
    +This is the total number of tuples transferred to a downstream bolt/spout for processing.
This number will not always match `__emit_count`.  If nothing is registered to receive a tuple
down stream the number will be 0 even if tuples were emitted.  Similarly if there are multiple
down stream consumers it may be a multiple of the number emitted.  The grouping also can play
a role if it sends the tuple to multiple instances of a single bolt down stream.
    +
    +##### `__execute-count`
    +
    +This count metrics is bolt specific.  It counts the number of times that a bolt's `execute`
method on a bolt was called.
    +
    +#### Tuple Latency Metrics
    +
    +Similar to the tuple counting metrics storm also collects average latency metrics for
bolts and spouts.  These follow the same structure as the bolt/spout maps and are sub-sampled
in the same way as well.  In all cases the latency is measured in milliseconds.
    +
    +##### `__complete-latency`
    +
    +The complete latency is just for spouts.  It is the average amount of time it took for
`ack` or `fail` to be called for a tuple after it was emitted.  If acking is disabled this
metric is likely to be blank or 0 for all values, but should be ignored.
    +
    +##### `__execute-latency`
    +
    +This is just for bolts.  It is the average amount of time that the bolt spent in the
call to the `execute` method.  The longer this gets the fewer tuples a single bolt instance
can process.
    +
    +##### `__process-latency`
    +
    +This is also just for bolts.  It is the average amount of time between when `execute`
was called to start processing a tuple, to when it was acked or failed by the bolt.  If your
bolt is a very simple bolt and the processing is synchronous then `__process-latency` and
`__execute-latency` should be very close to one another, with process latency being slightly
smaller.  If you are doing a join or have asynchronous processing then it may take a while
for a tuple to be acked so the process latency would be higher than the execute latency.
    +
    +##### `__skipped-max-spout-ms`
    +
    +This metric records how much time a spout was idle because more tuples than `topology.max.spout.pending`
were still outstanding.  This is the total time in milliseconds, not the average amount of
time and is not sub-sampled.
    +
    +
    +##### `__skipped-throttle-ms`
    +
    +This metric records how much time a spout was idle because back-pressure indicated that
downstream queues in the topology were too full.  This is the total time in milliseconds,
not the average amount of time and is not sub-sampled.
    +
    +##### `skipped-inactive-ms`
    +
    +This metric records how much time a spout was idle because the topology was deactivated.
 This is the total time in milliseconds, not the average amount of time and is not sub-sampled.
    +
    +#### Queue Metrics
    +
    +Each bolt or spout instance in a topology has a receive queue and a send queue.  Each
worker also has a queue for sending messages to other workers.  All of these have metrics
that are reported.
    +
    +The receive queue metrics are reported under the `__receive` name and send queue metrics
are reported under the `__sendqueue` for the given bolt/spout they are a part of.  The metrics
for the queue that sends messages to other workers is under the `__transfer` metric name for
the system bolt (`__system`).
    +
    +They all have the form.
    +
    +```
    +{
    +    "arrival_rate_secs": 1229.1195171893523,
    +    "overflow": 0,
    +    "read_pos": 103445,
    +    "write_pos": 103448,
    +    "sojourn_time_ms": 2.440771591407277,
    +    "capacity": 1024,
    +    "population": 19
    +}
    +```
    +
    +NOTE that in the `__receive` and `__transfer` queues a single entry may hold 1 or more
tuples in it.  For the `__sendqueue` metrics each slot holds a single tuple.  The batching
is an optimization that has been in storm since the beginning, so be careful with how you
interpret the metrics.  In older versions of storm all of the metrics represent slots in the
queue, and not tuples That has been updated so please be careful when trying to compare metrics
between different versions of storm.
    --- End diff --
    
    Wondering if it would be clearer to replace "entry" with "slot" here for consistency.
Is NOTE supposed to be all caps? Shouldn't it say "For the `__sendqueue` queue" instead? Missing
period after "and not tuples". I had a little trouble understanding the last half of this.
Would "In older versions of Storm all of the queue metrics counted slots in the queue. This
has been changed in version x.y.z so the queue metrics now always count tuples. Please be
careful when trying to compare metrics between different versions of Storm" be accurate? If
so I think it's maybe easier to understand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message