flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tzu-Li (Gordon) Tai" <tzuli...@apache.org>
Subject Re: Kafka Consumer fetch-size/rate and Producer queue timeout
Date Wed, 08 Nov 2017 10:09:24 GMT
Hi Ashish,

From your description I do not yet have much of an idea of what may be happening.
However, some of your observations seems reasonable. I’ll go through them one by one:

I did try to modify request.timeout.ms, linger.ms etc to help with the issue if it were caused
by a sudden burst of data or something along those lines. However, what it caused the app
to increase back pressure and made the slower and slower until that timeout is reached.

If the client is experiencing trouble in writing outstanding records to Kafka, and the timeout
is increased, then I think increased back pressure is indeed the expected behavior.

I noticed that consumer fetch-rate drops tremendously while fetch-size grows exponentially
BEFORE the producer actually start to show higher response-time and lower rates.

Drops on fetch-rate and growth on fetch-size in the Flink Kafka consumer should be a natural
consequence of backpressure in the job.
The fetch loop in the consumer will be blocked temporarily when backpressure is propagated
from downstream operators, resulting in longer fetch intervals and larger batches on each
fetch (given that events rate are still constant).
Therefore, I think the root cause is still along the lines of the producer side.

Would you happen to have any logs that maybe shows any useful information on the producer
side?
I think we might have a better chance of finding out what is going on by digging there.
Also, which Flink version & Kafka version are you using?

Cheers,
Gordon
On 5 November 2017 at 11:24:49 PM, Ashish Pokharel (ashishpok@yahoo.com) wrote:

All,  

I am starting to notice a strange behavior in a particular streaming app. I initially thought
it was a Producer issue as I was seeing timeout exceptions (records expiring in queue. I did
try to modify request.timeout.ms, linger.ms etc to help with the issue if it were caused by
a sudden burst of data or something along those lines. However, what it caused the app to
increase back pressure and made the slower and slower until that timeout is reached. With
lower timeouts, app would actually raise exception and recover faster. I can tell it is not
related to connectivity as other apps are running just fine around the same time frame connected
to same brokers (we have at least 10 streaming apps connected to same list of brokers) from
the same data nodes. We have enabled Graphite Reporter in all of our applications. After deep
diving into some of consumer and producer stats, I noticed that consumer fetch-rate drops
tremendously while fetch-size grows exponentially BEFORE the producer actually start to show
higher response-time and lower rates. Eventually, I noticed connection resets start to occur
and connection counts go up momentarily. After which, things get back to normal. Data producer
rates remain constant around that timeframe - we have Logstash producer sending data over.
We checked both Logstash and Kafka metrics and they seem to be showing same pattern (sort
of sin wave) throughout.  

It seems to point to Kafka issue (perhaps some tuning between Flink App and Kafka) but wanted
to check with the experts before I start knocking down Kafka Admin’s doors. Are there anything
else I can look into. There are quite a few default stats in Graphite but those were the ones
that made most sense.  

Thanks, Ashish
Mime
View raw message