kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tamás Máté <tamas0m...@gmail.com>
Subject Re: [KAFKA-5138] MirrorMaker doesn't exit on send failure occasionally
Date Tue, 15 Aug 2017 22:04:14 GMT
Hey,

I think I have found something.
My guess is that when the AbstactCoordinator.maybeLeaveGroup(...)
function's pollNoWakeUp part throws an exception, then it can not call
resetGeneration() and the HearBeatThread stays in STABLE state. The broker
won't be notified about the consumer's leave request so it thinks that
everything is all right and responds to its requests.

If this is the case it seems impossible to fix it at consumer side, maybe
with a new config parameter (leave retry?).
The other option could be a MirrorMaker fix for example when the producer
dies shoot the consumers in the head.

What do you think about these?

Although, I still couldn't repro the issue, will try to do that tomorrow. :)

Best regards,
Tamas

On 15 August 2017 at 14:49, Tamás Máté <tamas0mate@gmail.com> wrote:

> Hi Guys,
>
> I have just started to work on this ticket a little more than a week ago:
> https://issues.apache.org/jira/browse/KAFKA-5138
>
> I could not reproduce it sadly, but from the logs Dustin gave and from the
> code it seems like this might not be just a MirrorMaker issue but a
> consumer one.
>
> My theory is
>  1) MM send failure happens because of heavy load
>  2) MM starts to close its producer
>  3) during MM shutdown and the source server starts a consumer rebalance
> (the consumers couldn't respond because of the heavy load)
>  4) heartbeat response gets delayed
>  5) MM producer closed, but MM gets a heartbeat response and resets the
> connection
>  6) because there is thread left in the JVM it can't shut down
>  7) MM hangs
>
> Maybe the order is a bit different, I couldn't prove it without
> reproduction.
>
> I played with the following configs under 100ms and then stress tested the
> source cluster with JMeter.
>  - request.timeout.ms
>  - replica.lag.time.max.ms
>  - session.timeout.ms
>  - group.min.session.timeout.ms
>  - group.max.session.timeout.ms
>  - heartbeat.interval.ms
>
> Could you give me some pointers how could I reproduce this issue?
>
> Thanks,
> Tamas
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message