kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Cheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-3944) After the broker restart, fetchers stopped due to a delayed controlled shutdown message
Date Wed, 01 Feb 2017 02:34:51 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-3944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847898#comment-15847898
] 

James Cheng commented on KAFKA-3944:
------------------------------------

Is this related to KAFKA-1120?

> After the broker restart, fetchers stopped due to a delayed controlled shutdown message
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3944
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3944
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Maysam Yabandeh
>            Priority: Minor
>              Labels: reliability
>
> The symptom is that cluster reports under-replicated blocks and some replicas do not
seem to catch up ever. It turns out that the corresponding fetchers in those brokers were
stopped shortly after the broker's restart. The broker had stopped the fetcher upon receiving
stop-replica request from the controller. The controller had issued those request upon processing
controlled shutdown request form the same broker. However those requests were all sent before
the broker restart but the controller processed them after. Here is the timeline:
> # broker sends controlled shutdown message to controller
> # the process fails and the broker proceeds with an unclean shutdown
> # the broker is restated
> # the controller processes the perviously sent controlled shutdown messages
> # the controller sends stop replica messages to the broker
> # the broker shuts down the fetchers, while it has no intent to shut down again
> # this leads to under-replicated blocks
> Example from logs:
> {code}
> broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean shutdown" server.log.2016-07-07.2

> 2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled
shutdown after the previous attempt failed...
> 2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled
shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled
shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19], Proceeding to do
an unclean shutdown as all the controlled shutdown attempts failed
> broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3
> 2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values: 
> {code}
> {code}
> broker13.com:/var/log/kafka$ grep "Shutting down broker 19" controller.log.2016-07-07.1

> 2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]: Shutting down
broker 19
> 2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]: Shutting down
broker 19
> 2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]: Shutting down
broker 19
> {code}
> which resulted into many stop replica request to broker 19:
> {code}
> broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false) sent to
broker 19 is" controller.log.2016-07-07.1 | tail -1
> 2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The stop replica
request (delete = false) sent to broker 19 is [Topic=topic-xyz,Partition=6,Replica=19]
> {code}
> broker 19 processes them AFTER its restart:
> {code}
> broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for partition
.topic-xzy,3." state-change.log.2016-07-07.2 
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica (delete=false)
for partition [topic-xzy,3]
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop replica
(delete=false) for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica (delete=false)
for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop replica
(delete=false) for partition [topic-xyz,3]
> {code}
> and removes the fetchers:
> {code}
> broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3" server.log.2016-07-07.3
| tail -2
> 2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on
broker 19] Removed fetcher for partitions [topic-xyz,3]
> 2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on
broker 19] Removed fetcher for partitions [topic-xyz,3]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message