flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timo Walther <twal...@apache.org>
Subject Re: Question: Determining Total Recovery Time
Date Tue, 04 Feb 2020 11:31:57 GMT
Hi Morgan,

as far as I know this is not possible mostly because measuring "till the 
point when the system catches up to the last message" is very 
pipeline/connector dependent. Some sources might need to read from the 
very beginning, some just continue from the latest checkpointed offset.

Measure things like that (e.g. for experiments) might require collecting 
own metrics as part of your pipeline definition.


On 03.02.20 12:20, Morgan Geldenhuys wrote:
> Community,
> I am interested in determining the total time to recover for a Flink 
> application after experiencing a partial failure. Let's assume a 
> pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once 
> guarantees enabled.
> Taking a look at the documentation 
> (https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html),

> one of the metrics which can be gathered is /recoveryTime/. However, as 
> far as I can tell this is only the time taken for the system to go from 
> an inconsistent state back into a consistent state, i.e. restarting the 
> job. Is there any way of measuring the amount of time taken from the 
> point when the failure occurred till the point when the system catches 
> up to the last message that was processed before the outage?
> Thank you very much in advance!
> Regards,
> Morgan.

View raw message