flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4410) Report more information about operator checkpoints
Date Fri, 23 Dec 2016 20:39:58 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15773649#comment-15773649
] 

ASF GitHub Bot commented on FLINK-4410:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/3042

    [FLINK-4410] Expose more fine grained checkpoint statistics

    This PR exposes more fine grained checkpoint statistics. The previous version of the tracking
code had a couple of short comings:
    - Only completed checkpoints were tracked in the history. You did not see in progress
or failed checkpoints.
    - Only the latest completed checkpoint had more fine grained stats per operator and sub
tasks. This meant that a possibly interesting checkpoint statistics could be live updated
as you was looking at it.
    - Many newly tracked statistics like checkpoint duration at the operator or alignment
duration were not exposed.
    
    This PR addresses these issues. For the extended tracking of the life cycle I decided
to add tracking callbacks of all relevant entities like `PendingCheckpointStats`, `CompletedCheckpointStats`,
`SubtaskStateStats`, `TaskStateStats`, etc. The life cycle of these objects follows that of
their corresponding entities.
    
    Furtheremore, this add new REST API handlers that work with the new tracker and also new
layout for displaying them.
    
    ---
    
    Some screenshots:
    
    **Clicking on the Checkpoints Tab**: Sub tabs for overview, history, summary stats, and
the config.
    
    ![00-start](https://cloud.githubusercontent.com/assets/1756620/21461971/3fdfb9be-c957-11e6-9f61-62610aa95da4.png)
    
    **Clicking on the History Tab**: Lists recent checkpoints, including in progress ones.
    
    ![01-history](https://cloud.githubusercontent.com/assets/1756620/21461994/657fd0a0-c957-11e6-8d08-0f084e018aca.png)
    
    **Clicking on details for a checkpoint**:
    
    ![02-details](https://cloud.githubusercontent.com/assets/1756620/21462027/ce4577a2-c957-11e6-9851-9d225c3762f4.png)
    
    **After triggering a savepoint**:
    
    ![03-savepoint](https://cloud.githubusercontent.com/assets/1756620/21462031/d6857318-c957-11e6-810b-e6d639b5caaf.png)
    
    **Details for the triggered savepoint**:
    
    ![04-savepoint_details](https://cloud.githubusercontent.com/assets/1756620/21462038/e80c1916-c957-11e6-984c-2447ec877c2d.png)
    
    **Failed checkpoint while cancelling job**:
    
    ![05-failed_checkpoint](https://cloud.githubusercontent.com/assets/1756620/21462049/f9ac90f6-c957-11e6-8e0d-48dba2581378.png)
    
    ![06-failed_checkpoint_details](https://cloud.githubusercontent.com/assets/1756620/21462052/fdd2e068-c957-11e6-9cb6-e4ece5c5dd36.png)
    
    ![07-failed_checkpoint_overview](https://cloud.githubusercontent.com/assets/1756620/21462062/05fd444a-c958-11e6-8fc5-580f4e9e4e18.png)
    
    **Clicking on the config tab**:
    
    ![09-config](https://cloud.githubusercontent.com/assets/1756620/21462067/0d3f6210-c958-11e6-9e1a-0767a8f557a5.png)
    
    **After restoring from the savepoint**:
    
    ![08-restore_from_savepoint](https://cloud.githubusercontent.com/assets/1756620/21462071/1559a97e-c958-11e6-8ce5-b4287408d918.png)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 4410-checkpoint_stats

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3042.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3042
    
----
commit 700ec439ed0e9fb00c52e6e373a5bcccfecce963
Author: Ufuk Celebi <uce@apache.org>
Date:   2016-12-23T19:31:29Z

    [FLINK-4410] [runtime, runtime-web] Remove old checkpoint stats tracker code

commit c3f50c956f281a316a17b390851443c5be3adb6c
Author: Ufuk Celebi <uce@apache.org>
Date:   2016-12-23T19:37:08Z

    [FLINK-4410] [runtime] Rework checkpoint stats tracking

commit 1db53a69829be8472fb74b6b83f0d3638121762f
Author: Ufuk Celebi <uce@apache.org>
Date:   2016-12-23T19:44:12Z

    [FLINK-4410] [runtime-web] Add detailed checkpoint stats handlers

commit d6f6e7d48e05da47e02e8710fca699104bcc5988
Author: Ufuk Celebi <uce@apache.org>
Date:   2016-12-23T19:44:59Z

    [FLINK-4410] [runtime-web] Add new layout for checkpoint stats

commit ab6c597f51c4aeea81dde0f82a3e1e7e72571ad9
Author: Ufuk Celebi <uce@apache.org>
Date:   2016-12-23T19:47:02Z

    [FLINK-4410] [runtime-web] Rebuild JS/HTML files

----


> Report more information about operator checkpoints
> --------------------------------------------------
>
>                 Key: FLINK-4410
>                 URL: https://issues.apache.org/jira/browse/FLINK-4410
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing, Webfrontend
>    Affects Versions: 1.1.2
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>             Fix For: 1.2.0
>
>
> Checkpoint statistics contain the duration of a checkpoint, measured as from the CheckpointCoordinator's
start to the point when the acknowledge message came.
> We should additionally expose
>   - duration of the synchronous part of a checkpoint
>   - duration of the asynchronous part of a checkpoint
>   - number of bytes buffered during the stream alignment phase
>   - duration of the stream alignment phase
> Note: In the case of using *at-least once* semantics, the latter two will always be zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message