flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From uce <...@git.apache.org>
Subject [GitHub] flink pull request: [FLINK-3131] Expose checkpoint metrics
Date Tue, 15 Dec 2015 00:08:06 GMT
GitHub user uce opened a pull request:


    [FLINK-3131] Expose checkpoint metrics

    - Adds `long getStateSize()` to `StateHandle` and `KvStateSnapshot`. Everything except
test classes and `LazyDbKvState` implement this. `LazyDbKvState` could implement it correctly,
but currently the state is serialized lazily, which means that the state size is not known
(currently set as 0) when creating the state handle.
    - Adds simple statistics tracking to the checkpoint coordinator. This is not using the
accumulators, because I wanted more fine-grained control. I think we can expand the system
internal accumulators to accommodate these use cases better. It is also possible to retro
fit this on the accumulators, if you want to.
    - Adds the following web runtime monitor handlers:
      * `/jobs/:jobid/checkpoints` for completed checkpoint statistics for the job with the
      * `/jobs/:jobid/vertices/:vertexid/checkpoints` for per operator statistics including
    - Adds the web frontend HTML/Javascript (screenshots below)
    This feature can be disabled via `jobmanager.web.checkpoints.disable`. I think this is
good practice, because it is attached to one of the most critical parts of the system.
    The maximum history size (see screenshot) for job level statistics can be configured via
`jobmanager.web.checkpoints.history`. Current default is 10. Maybe a little too high?
    - **Checkpoints Tab** (Overview and Operators): 
    ![screen shot 2015-12-15 at 00 45 41](https://cloud.githubusercontent.com/assets/1756620/11797953/e4f17c84-a2c6-11e5-86b1-040a4e1bff12.png)
    - **History** (configurable):
     ![screen shot 2015-12-15 at 00 45 51](https://cloud.githubusercontent.com/assets/1756620/11797957/f2f87940-a2c6-11e5-82ce-5c5fcf8b1ca1.png)
    - **Subtasks**: 
    ![screen shot 2015-12-15 at 00 46 08](https://cloud.githubusercontent.com/assets/1756620/11797963/0d105fd2-a2c7-11e5-9a90-458bd0b7fdc4.png)
    - **Terminated job**: 
    ![screen shot 2015-12-15 at 00 46 44](https://cloud.githubusercontent.com/assets/1756620/11797969/1b0999a0-a2c7-11e5-9826-723f12e997d9.png)
    Jobs without checkpoints just show `No checkpoints` currently.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 3131-checkpoint_metrics

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1453
commit aa12f3c7bb6ac43b91d5926087d7c181958c95cb
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-12-14T18:40:10Z

    [FLINK-3131] [contrib, runtime, streaming-java] Add long getStateSize() to StateHandle
and KvStateSnapshot
    In order to report the state sizes, we need to expose them. All state backends
    currently available backends know the state size. Only the LazyDbKvState does
    not expose it at the moment, because it serializes the data lazily. This can be
    changed in a follow-up fix.

commit 2dae2a8ee98ca08cba4925f15110f1d9de2c1831
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-12-14T19:12:59Z

    [FLINK-3131] [core, runtime] Add checkpoint statistics tracker
    Adds a simple tracker of checkpoint statistics.

commit 53feb2a1a008f08218d05b91af4853ad18574fa2
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-12-14T19:13:59Z

    [FLINK-3131] [runtime-web] Add checkpoint statistics handlers

commit 47f89d5d24ae2fb6c314205531d696b985acb508
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-12-14T19:48:03Z

    [FLINK-3131] [runtime-web] Add checkpoint statistics to web frontend


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message