kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoffrey Stewart (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5016) Consumer hang in poll method while rebalancing is in progress
Date Fri, 14 Jul 2017 23:34:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088285#comment-16088285
] 

Geoffrey Stewart commented on KAFKA-5016:
-----------------------------------------

I have also encountered the issue documented in this Jira using 0.10.2.0 brokers with the
0.10.2.0 client.  This issue only occurs when we use the "subscribe" call from the API, which
dynamically assigns partitions.  When we use the "assign" call from the API, to manually assign
lists of partitions, we do not have any issue.  I don't think what is being described above
represents the expected behavior of dynamic partition assignment and consumer group coordination.
 Based on the above explanation it sounds like it would not be possible to have 2 or more
simultaneous consumer instances in the same consumer group when using dynamic partition assignment
(subscribe).  For example, there could be one consumer instance in the group which has made
some calls to "poll".  As soon as a second consumer instance comes along, it's call to "poll"
is only processed after max.poll.interval.ms has elapsed since the first consumer's most recent
poll request - at this time the broker will no longer consider that this first consumer is
part of the group.  I certainly agree that with the arrival of the second consumer to the
group, the broker must perform a rebalance or restabilization which may take some time.  However
this should not take max.poll.interval.ms since the liveness of the first consumer should
be maintained by it's heartbeat which occurs every heartbeat.interval.ms.  I have confirmed
that by using the default value for the property max.poll.interval.ms of 300000, the group
restabilization (rebalance) takes about this long (5mins) and then the second consumer instance's
poll request is processed.  Lowering this value to 30000, has the effect of reducing the group
restabilization (rebalance) to about 30 seconds before the second consumer instance's poll
request is processed.
To summarize, please explain how I can establish parallel consumer instances in the same group
using the subscribe method from the API, which dynamically assigns partitions.  Further, please
help me to understand why the consumer instances heartbeat does not seem to be maintaining
it's liveness.

> Consumer hang in poll method while rebalancing is in progress
> -------------------------------------------------------------
>
>                 Key: KAFKA-5016
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5016
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.0, 0.10.2.0
>            Reporter: Domenico Di Giulio
>            Assignee: Vahid Hashemian
>         Attachments: Kafka 0.10.2.0 Issue (TRACE) - Server + Client.txt, Kafka 0.10.2.0
Issue (TRACE).txt, KAFKA_5016.java
>
>
> After moving to Kafka 0.10.2.0, it looks like I'm experiencing a hang in the rebalancing
code. 
> This is a test case, not (still) production code. It does the following with a single-partition
topic and two consumers in the same group:
> 1) a topic with one partition is forced to be created (auto-created)
> 2) a producer is used to write 10 messages
> 3) the first consumer reads all the messages and commits
> 4) the second consumer attempts a poll() and hangs indefinitely
> The same issue can't be found with 0.10.0.0.
> See the attached logs at TRACE level. Look for "SERVER HANGS" to see where the hang is
found: when this happens, the client keeps failing any hearbeat attempt, as the rebalancing
is in progress, and the poll method hangs indefinitely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message