flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijiang" <wangzhijiang...@aliyun.com>
Subject Re: Backpressure and 99th percentile latency
Date Fri, 06 Mar 2020 04:54:42 GMT
Hi Felipe,

Try to answer your below questions.

> I understand that I am tracking latency every 10 seconds for each physical instance operator.
Is that right?

Generally right. The latency marker is emitted from source and flow through all the intermediate
operators until sink. This interval controls the emitting frequency of source.

> The backpressure goes away but the 99th percentile latency is still the same. Why? Does
it have no relation with each other?

The latency might be influenced by buffer flush timeout, network transport and load, etc.
 In the case of backpressure, there are huge in-flight data accumulated in network wire, so
the latency marker is queuing to wait for network transport which might bring obvious delay.
Even the latency marker can not be emitted in time from source because of no available buffers
temporarily. 

After the backpressure goes away, that does not mean there are no accumulated buffers on network
wire, just not reaching the degree of backpressure. So the latency marker still needs to be
queued with accumulated buffers on the wire. And it might take some time to digest the previous
accumulated buffers completed to relax the latency. I guess it might be your case. You can
monitor the metrics of "inputQueueLength" and "outputQueueLength" for confirming the status.
Anyway, the answer is yes that it has relation with backpressure, but might have some delay
to see the changes obviously.

>In the end I left the experiment for more than 2 hours running and only after about 1,5
hour the 99th percentile latency got down to milliseconds. Is that normal?

I guess it is normal as mentioned above.  After there are no accumulated buffers in network
stack completely without backpressure, it should go down to milliseconds.

Best,
Zhijiang
------------------------------------------------------------------
From:Felipe Gutierrez <felipe.o.gutierrez@gmail.com>
Send Time:2020 Mar. 6 (Fri.) 05:04
To:user <user@flink.apache.org>
Subject:Backpressure and 99th percentile latency

Hi,

I am a bit confused about the topic of tracking latency in Flink [1]. It says if I use the
latency track I am measuring the Flinkā€™s network stack but application code latencies also
can influence it. For instance, if I am using the metrics.latency.granularity: operator (default)
and setLatencyTrackingInterval(10000). I understand that I am tracking latency every 10 seconds
for each physical instance operator. Is that right?

In my application, I am tracking the latency of all aggregators. When I have a high workload
and I can see backpressure from the flink UI the 99th percentile latency is 13, 25, 21, and
25 seconds. Then I set my aggregator to have a larger window. The backpressure goes away but
the 99th percentile latency is still the same. Why? Does it have no relation with each other?

In the end I left the experiment for more than 2 hours running and only after about 1,5 hour
the 99th percentile latency got down to milliseconds. Is that normal? Please see the figure
attached.

[1] https://flink.apache.org/2019/07/23/flink-network-stack-2.html#latency-tracking

Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez

-- https://felipeogutierrez.blogspot.com

Mime
View raw message