flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Morgan Geldenhuys <morgan.geldenh...@tu-berlin.de>
Subject Question: Determining Total Recovery Time
Date Mon, 03 Feb 2020 11:20:28 GMT
Community,

I am interested in determining the total time to recover for a Flink 
application after experiencing a partial failure. Let's assume a 
pipeline consisting of Kafka -> Flink -> Kafka with Exactly-Once 
guarantees enabled.

Taking a look at the documentation 
(https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html), 
one of the metrics which can be gathered is /recoveryTime/. However, as 
far as I can tell this is only the time taken for the system to go from 
an inconsistent state back into a consistent state, i.e. restarting the 
job. Is there any way of measuring the amount of time taken from the 
point when the failure occurred till the point when the system catches 
up to the last message that was processed before the outage?

Thank you very much in advance!

Regards,
Morgan.

Mime
View raw message