kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxime Brugidou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-693) Consumer rebalance fails if no leader available for a partition and stops all fetchers
Date Mon, 14 Jan 2013 15:28:13 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552781#comment-13552781
] 

Maxime Brugidou commented on KAFKA-693:
---------------------------------------

I looked up the code in details and I am stuck because during the rebalance() operation, the
ZookeeperConsumerConnector's topicRegistry is updated with some PartitionTopicInfo that needs
to store the consumerOffset and fetchOffset. During addPartitionTopicInfo(), the consumer
offset is read from Zookeeper, however it needs to be initialized if no offsetString is available
on Zookeeper (first time starting a consumer), and we need to access the broker/leader to
get the start offset (using SimpleConsumer.earliestOrLatestOffset() in addPartitionTopicInfo()).

I digged a bit and we could probably initialize the offset later in the ConsumerFetcherManager?
I could help with a patch if i get general directions because i'm not 100% familiar with the
codebase yet.
                
> Consumer rebalance fails if no leader available for a partition and stops all fetchers
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-693
>                 URL: https://issues.apache.org/jira/browse/KAFKA-693
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8
>            Reporter: Maxime Brugidou
>         Attachments: mirror_debug.log, mirror.log
>
>
> I am currently experiencing this with the MirrorMaker but I assume it happens for any
rebalance. The symptoms are:
> I have replication factor of 1
> 1. If i start the MirrorMaker (bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config
mirror-consumer.properties  --producer.config mirror-producer.properties --blacklist 'xdummyx'
--num.streams=1 --num.producers=1) with a broker down
> 1.1 I set the refresh.leader.backoff.ms to 600000 (10min) so that the ConsumerFetcherManager
doesn't retry to often to get the unavailable partitions
> 1.2 The rebalance starts at the init step and fails: Exception in thread "main" kafka.common.ConsumerRebalanceFailedException:
KafkaMirror_mirror-01-1357893495345-fac86b15 can't rebalance after 4 retries
> 1.3 After the exception, everything stops (fetchers and queues)
> 1.4 I attached the full logs (info & debug) for this case
> 2. If i start the MirrorMaker with all the brokers up and then kill a broker
> 2.1 The first rebalance is successful
> 2.2 The consumer will handle correctly the broker down and stop the associated ConsumerFetcherThread
> 2.3 The refresh.leader.backoff.ms to 600000 works correctly
> 2.4 If something triggers a rebalance (new topic, partition reassignment...), then we
go back to 1., the rebalance fails and stops everything.
> I think the desired behavior is to consumer whatever is available, and try later at some
intervals. I would be glad to help on that issue although the Consumer code seems a little
tough to get on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message