I am interested in determining the total time to recover for a Flink
application after experiencing a partial failure. Let's assume a
pipeline consisting of Kafka -> Flink -> Kafka with
Exactly-Once guarantees enabled.
Taking a look at the documentation (https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/metrics.html),
one of the metrics which can be gathered is recoveryTime.
However, as far as I can tell this is only the time taken for the
system to go from an inconsistent state back into a consistent
state, i.e. restarting the job. Is there any way of measuring the
amount of time taken from the point when the failure occurred till
the point when the system catches up to the last message that was
processed before the outage?
Thank you very much in advance!