kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jiangjie Qin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (KAFKA-5678) When the broker graceful shutdown occurs, the producer side sends timeout.
Date Tue, 01 Aug 2017 22:35:02 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109910#comment-16109910
] 

Jiangjie Qin edited comment on KAFKA-5678 at 8/1/17 10:34 PM:
--------------------------------------------------------------

That's right. I was under the impression that KAFKA-4444 has been merged, but apparently it
hasn't. If that is the case, the LeaderAndIsrRequest during controlled shutdown is not batched
and may result in slow controlled shutdown. That could result in request timeout on the client
side. 

That said, I am not sure if the correct way to fix this is to let the broker return an error
response regardless of whether the leader migration related to the ProduceRequest has finished
or not. In general, Controller should be the source of truth for the leadership, so the broker
should not claim it is not the leader of a partition on its own. If we return NOT_LEADER_FOR_PARTITION
as long as the broker is shutting down and does not own one of the partitions in the ProduceRequest,
this would likely result in frequent ProduceRequests to the broker but got rejected pretty
quickly. Compared with request timeout, that sounds more confusing and frustrating.

Personally I prefer fixing KAFKA-4444 and potentially KAFKA-4453 so that the state transition
would be quicker. 


was (Author: becket_qin):
That's right. I was under the impression that KAFKA-4444 has been merged, but apparently it
hasn't. If that is the case, the LeaderAndIsrRequest during controlled shutdown is not batched
and may result in slow controlled shutdown. That could result in request timeout on the client
side. 

That said, I am not sure if the correct way of fixing this is to let the broker return an
error response regardless of whether the leader migration related to the ProduceRequest has
finished or not. In general, Controller should be the source of truth for the leadership,
so the broker should not claim it is not the leader of a partition on its own. If we return
NOT_LEADER_FOR_PARTITION as long as the broker is shutting down and does not own one of the
partitions in the ProduceRequest, this would likely result in frequent ProduceRequests to
the broker but got rejected pretty quickly. Compared with request timeout, that sounds more
confusing and frustrating.

Personally I prefer fixing KAFKA-4444 and potentially KAFKA-4453 so that the state transition
would be quicker. 

> When the broker graceful shutdown occurs, the producer side sends timeout.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-5678
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5678
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.9.0.0, 0.10.0.0, 0.11.0.0
>            Reporter: tuyang
>
> Test environment as follows.
> 1.Kafka version:0.9.0.1
> 2.Cluster with 3 broker which with broker id A,B,C 
> 3.Topic with 6 partitions with 2 replicas,with 2 leader partitions at each broker.
> We can reproduce the problem as follows.
> 1.we send message as quickly as possible with ack -1.
> 2.if partition p0's leader is on broker A and we graceful shutdown broker A,but we
send a message to p0 before the leader is reelect, so the message can be appended to the leader
replica successful, but if the follower replica not catch it as quickly as possible, so the
shutting down broker will create a delayProduce for this request to wait complete until request.timeout.ms
.
> 3.because of the controllerShutdown request from broker A, then the p0 partition leader
will reelect
> , then the replica on broker A will become follower before complete shut down.then the
delayProduce will not be trigger to complete until expire. 
> 4.if broker A shutdown cost too long, then the producer will get response after request.timeout.ms,
which results in increase the producer send latency when we are restarting broker one by one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message