Return-Path: X-Original-To: apmail-flink-commits-archive@minotaur.apache.org Delivered-To: apmail-flink-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1C31D1912F for ; Wed, 6 Apr 2016 10:24:36 +0000 (UTC) Received: (qmail 96899 invoked by uid 500); 6 Apr 2016 10:24:35 -0000 Delivered-To: apmail-flink-commits-archive@flink.apache.org Received: (qmail 96854 invoked by uid 500); 6 Apr 2016 10:24:35 -0000 Mailing-List: contact commits-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list commits@flink.apache.org Received: (qmail 96845 invoked by uid 99); 6 Apr 2016 10:24:35 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Apr 2016 10:24:35 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id AA999DFC6C; Wed, 6 Apr 2016 10:24:35 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: uce@apache.org To: commits@flink.apache.org Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: flink git commit: [docs] Add back pressure monitoring page Date: Wed, 6 Apr 2016 10:24:35 +0000 (UTC) Repository: flink Updated Branches: refs/heads/release-1.0 cb320041d -> ddc07c10e [docs] Add back pressure monitoring page Project: http://git-wip-us.apache.org/repos/asf/flink/repo Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/ddc07c10 Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/ddc07c10 Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/ddc07c10 Branch: refs/heads/release-1.0 Commit: ddc07c10e9e92742666584ec804277e059c01801 Parents: cb32004 Author: Ufuk Celebi Authored: Wed Apr 6 12:18:44 2016 +0200 Committer: Ufuk Celebi Committed: Wed Apr 6 12:24:26 2016 +0200 ---------------------------------------------------------------------- docs/internals/back_pressure_monitoring.md | 83 +++++++++++++++++++ docs/internals/fig/back_pressure_sampling.png | Bin 0 -> 17635 bytes .../fig/back_pressure_sampling_high.png | Bin 0 -> 77546 bytes .../fig/back_pressure_sampling_in_progress.png | Bin 0 -> 79112 bytes .../internals/fig/back_pressure_sampling_ok.png | Bin 0 -> 79668 bytes 5 files changed, 83 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/flink/blob/ddc07c10/docs/internals/back_pressure_monitoring.md ---------------------------------------------------------------------- diff --git a/docs/internals/back_pressure_monitoring.md b/docs/internals/back_pressure_monitoring.md new file mode 100644 index 0000000..d272eaf --- /dev/null +++ b/docs/internals/back_pressure_monitoring.md @@ -0,0 +1,83 @@ +--- +title: "Back Pressure Monitoring" +# Top navigation +top-nav-group: internals +top-nav-pos: 9 +--- + + +Flink's web interface provides a tab to monitor the back pressure behaviour of running jobs. + +* ToC +{:toc} + +## Back Pressure + +If you see a **back pressure warning** (e.g. `High`) for a task, this means that it is producing data faster than the downstream operators can consume. Records in your job flow downstream (e.g. from sources to sinks) and back pressure is propagated in the opposite direction, up the stream. + +Take a simple `Source -> Sink` job as an example. If you see a warning for `Source`, this means that `Sink` is consuming data slower than `Source` is producing. `Sink` is back pressuring the upstream operator `Source`. + + +## Sampling Threads + +Back pressure monitoring works by repeatedly taking stack trace samples of your running tasks. The JobManager triggers repeated calls to `Thread.getStackTrace()` for the tasks of your job. + + + + +If the samples show that a task Thread is stuck in a certain internal method call (requesting buffers from the network stack), this indicates that there is back pressure for the task. + +By default, the job manager triggers 100 stack traces every 50ms for each task in order to determine back pressure. The ratio you see in the web interface tells you how many of these stack traces were stuck in the internal method call, e.g. `0.01` indicates that only 1 in 100 was stuck in that method. + +- **OK**: 0 <= Ratio <= 0.10 +- **LOW**: 0.10 < Ratio <= 0.5 +- **HIGH**: 0.5 < Ratio <= 1 + +In order to not overload the task managers with stack trace samples, the web interface refreshes samples only after 60 seconds. + +## Configuration + +You can configure the number of samples for the job manager with the following confiugration keys: + +- `jobmanager.web.backpressure.refresh-interval`: Time after which available stats are deprecated and need to be refreshed (DEFAULT: 60000, 1 min). +- `jobmanager.web.backpressure.num-samples`: Number of stack trace samples to take to determine back pressure (DEFAULT: 100). +- `jobmanager.web.backpressure.delay-between-samples`: Delay between stack trace samples to determine back pressure (DEFAULT: 50, 50 ms). + + +## Example + +You can find the *Back Pressure* tab next to the job overview. + +### Sampling In Progress + +This means that the JobManager triggered a stack trace sample of the running tasks. With the default configuration, this takes about 5 seconds to complete. + +Note that clicking the row, you trigger the sample for all subtasks of this operator. + + + +### Back Pressure Status + +If you see status **OK** for the tasks, there is no indication of back pressure. **HIGH** on the other hand means that the tasks are back pressured. + + + + + http://git-wip-us.apache.org/repos/asf/flink/blob/ddc07c10/docs/internals/fig/back_pressure_sampling.png ---------------------------------------------------------------------- diff --git a/docs/internals/fig/back_pressure_sampling.png b/docs/internals/fig/back_pressure_sampling.png new file mode 100644 index 0000000..ad6ce2f Binary files /dev/null and b/docs/internals/fig/back_pressure_sampling.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/ddc07c10/docs/internals/fig/back_pressure_sampling_high.png ---------------------------------------------------------------------- diff --git a/docs/internals/fig/back_pressure_sampling_high.png b/docs/internals/fig/back_pressure_sampling_high.png new file mode 100644 index 0000000..15372fd Binary files /dev/null and b/docs/internals/fig/back_pressure_sampling_high.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/ddc07c10/docs/internals/fig/back_pressure_sampling_in_progress.png ---------------------------------------------------------------------- diff --git a/docs/internals/fig/back_pressure_sampling_in_progress.png b/docs/internals/fig/back_pressure_sampling_in_progress.png new file mode 100644 index 0000000..96ec3cd Binary files /dev/null and b/docs/internals/fig/back_pressure_sampling_in_progress.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/ddc07c10/docs/internals/fig/back_pressure_sampling_ok.png ---------------------------------------------------------------------- diff --git a/docs/internals/fig/back_pressure_sampling_ok.png b/docs/internals/fig/back_pressure_sampling_ok.png new file mode 100644 index 0000000..2ca2d51 Binary files /dev/null and b/docs/internals/fig/back_pressure_sampling_ok.png differ