Return-Path: X-Original-To: apmail-kafka-dev-archive@www.apache.org Delivered-To: apmail-kafka-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 55BC8DEEB for ; Mon, 4 Mar 2013 18:21:13 +0000 (UTC) Received: (qmail 75900 invoked by uid 500); 4 Mar 2013 18:21:13 -0000 Delivered-To: apmail-kafka-dev-archive@kafka.apache.org Received: (qmail 75866 invoked by uid 500); 4 Mar 2013 18:21:13 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 75763 invoked by uid 99); 4 Mar 2013 18:21:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Mar 2013 18:21:13 +0000 Date: Mon, 4 Mar 2013 18:21:13 +0000 (UTC) From: "Chris Curtin (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (KAFKA-783) Preferred replica assignment on leader failure may not be correct MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Chris Curtin created KAFKA-783: ---------------------------------- Summary: Preferred replica assignment on leader failure may no= t be correct Key: KAFKA-783 URL: https://issues.apache.org/jira/browse/KAFKA-783 Project: Kafka Issue Type: Bug Components: replication Affects Versions: 0.8 Environment: $ uname -a Linux vrd01.atlnp1 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 201= 2 x86_64 x86_64 x86_64 GNU/Linux $ java -version java version "1.6.0_25" Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode) Kafka 0.8.0 loaded from HEAD on 1/29/2013 Reporter: Chris Curtin Assignee: Neha Narkhede Based on an email thread in the user group, Neha asked me to submit this. Original question: "> I ran another test, again starting with a full cluste= r and all partitions > had a full set of copies. When I stop the broker which was leader for 9 o= f > the 10 partitions, the leaders were all elected on one machine instead of > the set of 3. Should the leaders have been better spread out? Also the > copies weren=E2=80=99t fully populated either." Neha: "For problem 2, we always try to make the preferred replica (1st repl= ica in the list of all replicas for a partition) the leader, if it is available. We intended to spread the preferred replica for all partitions for a topic evenly across the brokers. If this is not happening, we need to look into it. Please can you file a bug and describe your test case there ?= " Configuration: 4 node cluster 1 topic with 3 replicas 10 partitions: 0-9 below Current status: Partition: 0:vrd01.atlnp1 R:[ vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[ = vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] Partition: 1:vrd01.atlnp1 R:[ vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 2:vrd01.atlnp1 R:[ vrd01.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[ = vrd01.atlnp1 vrd03.atlnp1 vrd02.atlnp1] Partition: 3:vrd03.atlnp1 R:[ vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[ = vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 4:vrd01.atlnp1 R:[ vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[ = vrd01.atlnp1 vrd03.atlnp1 vrd02.atlnp1] Partition: 5:vrd03.atlnp1 R:[ vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[ = vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 6:vrd01.atlnp1 R:[ vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[ = vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] Partition: 7:vrd01.atlnp1 R:[ vrd02.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 8:vrd03.atlnp1 R:[ vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[ = vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 9:vrd01.atlnp1 R:[ vrd04.atlnp1 vrd03.atlnp1 vrd01.atlnp1] I:[ = vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] Shutdown vrd03: Partition: 0:vrd01.atlnp1 R:[ ] I:[] Partition: 1:vrd01.atlnp1 R:[ vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 2:vrd01.atlnp1 R:[ ] I:[] *Partition: 3:vrd04.atlnp1 R:[ ] I:[] Partition: 4:vrd01.atlnp1 R:[ ] I:[] *Partition: 5:vrd04.atlnp1 R:[ ] I:[] Partition: 6:vrd01.atlnp1 R:[ ] I:[] Partition: 7:vrd01.atlnp1 R:[ vrd02.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd02.atlnp1] *Partition: 8:vrd04.atlnp1 R:[ ] I:[] Partition: 9:vrd01.atlnp1 R:[ ] I:[] (* means leader changed) Note that partitions 3, 5 and 8 were assigned new leaders. Per an email group thread with Neha, the new leader should be assigned from= the preferred replica. So 3 should have gotten vrd02, 5, vrd04 and 8 vrd02= (since 03 was shutdown). Instead 3 got vrd04, 5 got vrd04 and 8 got vrd04. Restarting vrd03 led to: Partition: 0:vrd01.atlnp1 R:[ vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd03.atlnp1] Partition: 1:vrd01.atlnp1 R:[ vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 2:vrd01.atlnp1 R:[ vrd01.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[ = vrd01.atlnp1 vrd02.atlnp1 vrd03.atlnp1] Partition: 3:vrd04.atlnp1 R:[ vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[ = vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] Partition: 4:vrd01.atlnp1 R:[ vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[ = vrd01.atlnp1 vrd02.atlnp1 vrd03.atlnp1] Partition: 5:vrd04.atlnp1 R:[ vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[ = vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] Partition: 6:vrd01.atlnp1 R:[ vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd03.atlnp1] Partition: 7:vrd01.atlnp1 R:[ vrd02.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd02.atlnp1] Partition: 8:vrd04.atlnp1 R:[ vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[ = vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] Partition: 9:vrd01.atlnp1 R:[ vrd04.atlnp1 vrd03.atlnp1 vrd01.atlnp1] I:[ = vrd01.atlnp1 vrd04.atlnp1 vrd03.atlnp1] Stopping vrd01 now led to: *Partition: 0:vrd04.atlnp1 R:[ ] I:[] *Partition: 1:vrd04.atlnp1 R:[ ] I:[] *Partition: 2:vrd02.atlnp1 R:[ ] I:[] Partition: 3:vrd04.atlnp1 R:[ vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[ = vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] *Partition: 4:vrd02.atlnp1 R:[ ] I:[] Partition: 5:vrd04.atlnp1 R:[ vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[ = vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] *Partition: 6:vrd04.atlnp1 R:[ ] I:[] *Partition: 7:vrd04.atlnp1 R:[ ] I:[] Partition: 8:vrd04.atlnp1 R:[ vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[ = vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] *Partition: 9:vrd04.atlnp1 R:[ ] I:[] (* means leader changed) So 0, 2, 4, 6 and 7 were assigned the wrong leader (If preferred was first = in the list. If last in list 1 & 2 are wrong) Java code: kafka.javaapi.consumer.SimpleConsumer consumer =3D new SimpleConsumer("vr= d04.atlnp1", 9092, 100000, 64 * 1024, "test"); List topics2 =3D new ArrayList(); topics2.add("storm-anon"); TopicMetadataRequest req =3D new TopicMetadataRequest(topics2); kafka.javaapi.TopicMetadataResponse resp =3D consumer.send(req); List data3 =3D resp.topicsMetadata(); for (kafka.javaapi.TopicMetadata item : data3) { for (kafka.javaapi.PartitionMetadata part: item.partitionsMetada= ta() ) { String replicas =3D ""; String isr =3D ""; for (kafka.cluster.Broker replica: part.replicas() ) { replicas +=3D " " + replica.host(); } for (kafka.cluster.Broker replica: part.isr() ) { isr +=3D " " + replica.host(); } System.out.println( "Partition: " + part.partitionId() + "= :" + part.leader().host() + " R:[ " + replicas + "] I:[" + isr + "]"); } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira