Return-Path: X-Original-To: apmail-kafka-dev-archive@www.apache.org Delivered-To: apmail-kafka-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B427BE0DC for ; Tue, 15 Jan 2013 18:14:12 +0000 (UTC) Received: (qmail 26845 invoked by uid 500); 15 Jan 2013 18:14:12 -0000 Delivered-To: apmail-kafka-dev-archive@kafka.apache.org Received: (qmail 26804 invoked by uid 500); 15 Jan 2013 18:14:12 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 26794 invoked by uid 99); 15 Jan 2013 18:14:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jan 2013 18:14:12 +0000 Date: Tue, 15 Jan 2013 18:14:12 +0000 (UTC) From: "Jun Rao (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-693) Consumer rebalance fails if no leader available for a partition and stops all fetchers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/KAFKA-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554068#comment-13554068 ] Jun Rao commented on KAFKA-693: ------------------------------- Thanks for the patch. Some comments: 10. ZookeeperConsumerConnector: Let's define a constant InvalidOffset, instead of using -1 directly. 11. ConsumerFetcherManager.doWork(): After we identify the leader of a partition, the leader could change immediately. So, we may hit the exception when calling addFetcher(). When this happens, we haven't added the partition to the fetcher and we don't want to lose it. So, we should add it back to noLeaderPartitionSet so that we can find the new leader later. 12. ReplicaFetcherThread: Yes, it should also throw an exception if getOffsetBefore returns an error. 13. AbstractFetcherThread.doWork(): We need to handle the exception when calling handleOffsetOutOfRange(). If we get an exception, we should add the partition to partitionsWithError. This will cover both ConsumerFetcherThread and ReplicaFetcherThread. > Consumer rebalance fails if no leader available for a partition and stops all fetchers > -------------------------------------------------------------------------------------- > > Key: KAFKA-693 > URL: https://issues.apache.org/jira/browse/KAFKA-693 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8 > Reporter: Maxime Brugidou > Assignee: Maxime Brugidou > Attachments: KAFKA-693.patch, mirror_debug.log, mirror.log > > > I am currently experiencing this with the MirrorMaker but I assume it happens for any rebalance. The symptoms are: > I have replication factor of 1 > 1. If i start the MirrorMaker (bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config mirror-consumer.properties --producer.config mirror-producer.properties --blacklist 'xdummyx' --num.streams=1 --num.producers=1) with a broker down > 1.1 I set the refresh.leader.backoff.ms to 600000 (10min) so that the ConsumerFetcherManager doesn't retry to often to get the unavailable partitions > 1.2 The rebalance starts at the init step and fails: Exception in thread "main" kafka.common.ConsumerRebalanceFailedException: KafkaMirror_mirror-01-1357893495345-fac86b15 can't rebalance after 4 retries > 1.3 After the exception, everything stops (fetchers and queues) > 1.4 I attached the full logs (info & debug) for this case > 2. If i start the MirrorMaker with all the brokers up and then kill a broker > 2.1 The first rebalance is successful > 2.2 The consumer will handle correctly the broker down and stop the associated ConsumerFetcherThread > 2.3 The refresh.leader.backoff.ms to 600000 works correctly > 2.4 If something triggers a rebalance (new topic, partition reassignment...), then we go back to 1., the rebalance fails and stops everything. > I think the desired behavior is to consumer whatever is available, and try later at some intervals. I would be glad to help on that issue although the Consumer code seems a little tough to get on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira