flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Zhen Wu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-8043) change fullRestarts (for fine grained recovery) from guage to counter
Date Sun, 24 Dec 2017 20:11:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Zhen Wu updated FLINK-8043:
----------------------------------
    Description: 
Fine grained recovery publish fullRestarts as guage, which is not suitable for threshold based
alerting. Usually we would alert like "fullRestarts > 0 happens 10 times in last 15 minutes".

In comparison, "task_failures" is published as counter.

  was:When fine grained recovery failed (e.g. due to not enough taskmager slots when replacement
taskmanager node didn't come back in time), Flink will revert to full job restart. In this
case, it should also increment "job restart" metric

        Summary: change fullRestarts (for fine grained recovery) from guage to counter  (was:
increment job restart metric when fine grained recovery reverted to full job restart)

> change fullRestarts (for fine grained recovery) from guage to counter
> ---------------------------------------------------------------------
>
>                 Key: FLINK-8043
>                 URL: https://issues.apache.org/jira/browse/FLINK-8043
>             Project: Flink
>          Issue Type: Bug
>          Components: ResourceManager
>    Affects Versions: 1.3.2
>            Reporter: Steven Zhen Wu
>
> Fine grained recovery publish fullRestarts as guage, which is not suitable for threshold
based alerting. Usually we would alert like "fullRestarts > 0 happens 10 times in last
15 minutes".
> In comparison, "task_failures" is published as counter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message