Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1CE23200D43 for ; Mon, 6 Nov 2017 21:54:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1BB9F160BD5; Mon, 6 Nov 2017 20:54:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 44937160BEC for ; Mon, 6 Nov 2017 21:54:06 +0100 (CET) Received: (qmail 90312 invoked by uid 500); 6 Nov 2017 20:54:05 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 90246 invoked by uid 99); 6 Nov 2017 20:54:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Nov 2017 20:54:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id AB6261A205B for ; Mon, 6 Nov 2017 20:54:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id aV2CjTKfmqmZ for ; Mon, 6 Nov 2017 20:54:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 8BC7B6115A for ; Mon, 6 Nov 2017 20:54:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 8F2A0E2592 for ; Mon, 6 Nov 2017 20:54:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id B56E4241BE for ; Mon, 6 Nov 2017 20:54:00 +0000 (UTC) Date: Mon, 6 Nov 2017 20:54:00 +0000 (UTC) From: "AS (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (KAFKA-6178) Broker is listed as only ISR for all partitions it is leader of MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 06 Nov 2017 20:54:07 -0000 [ https://issues.apache.org/jira/browse/KAFKA-6178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] AS updated KAFKA-6178: ---------------------- Description: We're running a 15 broker cluster on windows machines, and one of the brokers, 10, is the only ISR on all partitions that it is the leader of. On partitions where it isn't the leader, it seems to follow the leadeer fine. This is an excerpt from 'describe': Topic: ClientQosCombined Partition: 458 Leader: 10 Replicas: 10,6,7,8,9,0,1 Isr: 10 Topic: ClientQosCombined Partition: 459 Leader: 11 Replicas: 11,7,8,9,0,1,10 Isr: 0,10,1,9,7,11,8 The server.log files all seem to be pretty standard, and the only indication of this issue is the following pattern that often repeats: 2017-11-06 20:28:25,207 [INFO] kafka.cluster.Partition [kafka-request-handler-8:] - Partition [ClientQosCombined,398] on broker 10: Expanding ISR for partition [ClientQosCombined,398] from 10 to 5,10 2017-11-06 20:28:39,382 [INFO] kafka.cluster.Partition [kafka-scheduler-1:] - Partition [ClientQosCombined,398] on broker 10: Shrinking ISR for partition [ClientQosCombined,398] from 5,10 to 10 For each of the partitions that 10 leads. This is the only topic that we currently have in our cluster. The __consumer_offsets topic seems completely normal in terms of isr counts. The controller is broker 5, which is cycling through attempting and failing to trigger leader elections on broker 10 led partitions. From the controller log in broker 5: 2017-11-06 20:45:04,857 [INFO] kafka.controller.KafkaController [kafka-scheduler-0:] - [Controller 5]: Starting preferred replica leader election for partitions [ClientQosCombined,375] 2017-11-06 20:45:04,857 [INFO] kafka.controller.PartitionStateMachine [kafka-scheduler-0:] - [Partition state machine on Controller 5]: Invoking state change to OnlinePartition for partitions [ClientQosCombined,375] 2017-11-06 20:45:04,857 [INFO] kafka.controller.PreferredReplicaPartitionLeaderSelector [kafka-scheduler-0:] - [PreferredReplicaPartitionLeaderSelector]: Current leader 10 for partition [ClientQosCombined,375] is not the preferred replica. Trigerring preferred replica leader election 2017-11-06 20:45:04,857 [WARN] kafka.controller.KafkaController [kafka-scheduler-0:] - [Controller 5]: Partition [ClientQosCombined,375] failed to complete preferred replica leader election. Leader is 10 I've also attached the logs and output from broker 10. Any idea what's wrong here? was: We're running a 15 broker cluster on windows machines, and one of the brokers, 10, is the only ISR on all partitions that it is the leader of. On partitions where it isn't the leader, it seems to follow the leadeer fine. This is an excerpt from 'describe': {{ Topic: ClientQosCombined Partition: 458 Leader: 10 Replicas: 10,6,7,8,9,0,1 Isr: 10 Topic: ClientQosCombined Partition: 459 Leader: 11 Replicas: 11,7,8,9,0,1,10 Isr: 0,10,1,9,7,11,8 }} The server.log files all seem to be pretty standard, and the only indication of this issue is the following pattern that often repeats: {{2017-11-06 20:28:25,207 [INFO] kafka.cluster.Partition [kafka-request-handler-8:] - Partition [ClientQosCombined,398] on broker 10: Expanding ISR for partition [ClientQosCombined,398] from 10 to 5,10 2017-11-06 20:28:39,382 [INFO] kafka.cluster.Partition [kafka-scheduler-1:] - Partition [ClientQosCombined,398] on broker 10: Shrinking ISR for partition [ClientQosCombined,398] from 5,10 to 10}} For each of the partitions that 10 leads. This is the only topic that we currently have in our cluster. The __consumer_offsets topic seems completely normal in terms of isr counts. The controller is broker 5, which is cycling through attempting and failing to trigger leader elections on broker 10 led partitions. From the controller log in broker 5: {{2017-11-06 20:45:04,857 [INFO] kafka.controller.KafkaController [kafka-scheduler-0:] - [Controller 5]: Starting preferred replica leader election for partitions [ClientQosCombined,375] 2017-11-06 20:45:04,857 [INFO] kafka.controller.PartitionStateMachine [kafka-scheduler-0:] - [Partition state machine on Controller 5]: Invoking state change to OnlinePartition for partitions [ClientQosCombined,375] 2017-11-06 20:45:04,857 [INFO] kafka.controller.PreferredReplicaPartitionLeaderSelector [kafka-scheduler-0:] - [PreferredReplicaPartitionLeaderSelector]: Current leader 10 for partition [ClientQosCombined,375] is not the preferred replica. Trigerring preferred replica leader election 2017-11-06 20:45:04,857 [WARN] kafka.controller.KafkaController [kafka-scheduler-0:] - [Controller 5]: Partition [ClientQosCombined,375] failed to complete preferred replica leader election. Leader is 10}} I've also attached the logs and output from broker 10. Any idea what's wrong here? > Broker is listed as only ISR for all partitions it is leader of > --------------------------------------------------------------- > > Key: KAFKA-6178 > URL: https://issues.apache.org/jira/browse/KAFKA-6178 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.1.0 > Environment: Windows > Reporter: AS > Labels: windows > Attachments: KafkaServiceOutput.txt, log-cleaner.log, server.log > > > We're running a 15 broker cluster on windows machines, and one of the brokers, 10, is the only ISR on all partitions that it is the leader of. On partitions where it isn't the leader, it seems to follow the leadeer fine. This is an excerpt from 'describe': > Topic: ClientQosCombined Partition: 458 Leader: 10 Replicas: 10,6,7,8,9,0,1 Isr: 10 > Topic: ClientQosCombined Partition: 459 Leader: 11 Replicas: 11,7,8,9,0,1,10 Isr: 0,10,1,9,7,11,8 > The server.log files all seem to be pretty standard, and the only indication of this issue is the following pattern that often repeats: > 2017-11-06 20:28:25,207 [INFO] kafka.cluster.Partition [kafka-request-handler-8:] - Partition [ClientQosCombined,398] on broker 10: Expanding ISR for partition [ClientQosCombined,398] from 10 to 5,10 > 2017-11-06 20:28:39,382 [INFO] kafka.cluster.Partition [kafka-scheduler-1:] - Partition [ClientQosCombined,398] on broker 10: Shrinking ISR for partition [ClientQosCombined,398] from 5,10 to 10 > For each of the partitions that 10 leads. This is the only topic that we currently have in our cluster. The __consumer_offsets topic seems completely normal in terms of isr counts. The controller is broker 5, which is cycling through attempting and failing to trigger leader elections on broker 10 led partitions. From the controller log in broker 5: > 2017-11-06 20:45:04,857 [INFO] kafka.controller.KafkaController [kafka-scheduler-0:] - [Controller 5]: Starting preferred replica leader election for partitions [ClientQosCombined,375] > 2017-11-06 20:45:04,857 [INFO] kafka.controller.PartitionStateMachine [kafka-scheduler-0:] - [Partition state machine on Controller 5]: Invoking state change to OnlinePartition for partitions [ClientQosCombined,375] > 2017-11-06 20:45:04,857 [INFO] kafka.controller.PreferredReplicaPartitionLeaderSelector [kafka-scheduler-0:] - [PreferredReplicaPartitionLeaderSelector]: Current leader 10 for partition [ClientQosCombined,375] is not the preferred replica. Trigerring preferred replica leader election > 2017-11-06 20:45:04,857 [WARN] kafka.controller.KafkaController [kafka-scheduler-0:] - [Controller 5]: Partition [ClientQosCombined,375] failed to complete preferred replica leader election. Leader is 10 > I've also attached the logs and output from broker 10. Any idea what's wrong here? -- This message was sent by Atlassian JIRA (v6.4.14#64029)