kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-4509) Task reusage on rebalance fails for threads on same host
Date Wed, 07 Dec 2016 21:08:58 GMT
Matthias J. Sax created KAFKA-4509:

             Summary: Task reusage on rebalance fails for threads on same host
                 Key: KAFKA-4509
                 URL: https://issues.apache.org/jira/browse/KAFKA-4509
             Project: Kafka
          Issue Type: Bug
          Components: streams
            Reporter: Matthias J. Sax
            Assignee: Matthias J. Sax

In https://issues.apache.org/jira/browse/KAFKA-3559 task reusage on rebalance was introduces
as a performance optimization. Instead of closing a task on rebalance (ie, {{onPartitionsRevoked()}},
it only get's suspended for a potential reuse in {{onPartitionsAssigned()}}. Only if a task
cannot be reused, it will eventually get closed in {{onPartitionsAssigned()}}.

This mechanism can fail, if multiple {{StreamThreads}} run in the same host (same or different
JVM). The scenario is as follows:

 - assume 2 running threads A and B
 - assume 3 tasks t1, t2, t3
 - assignment: A-(t1,t2) and B-(t3)
 - on the same host, a new single threaded Stream application (same app-id) gets started (thread
 - on rebalance, t2 (could also be t1 -- does not matter) will be moved from A to C
 - as assignment is only sticky base on an heurictic t1 can sometimes be assigned to B, too
-- and t3 get's assigned to A (thre is a race condition if this "task flipping" happens or
 - on revoke, A will suspend task t1 and t2 (not releasing any locks)
 - on assign
    - A tries to create t3 but as B did not release it yet, A dies with an "cannot get lock"
    - B tries to create t1 but as A did not release it yet, B dies with an "cannot get lock"
    - as A and B trie to create the task first, this will always fail if task flipping happened
   - C tries to create t2 but A did not release t2 lock yet (race condition) and C dies with
an exception (this could even happen without "task flipping" between A and B)

We want to fix this, by:
  # first release unassigned suspended tasks in {{onPartitionsAssignment()}}, and afterward
create new tasks (this fixes the "task flipping" issue)
  # use a "backoff and retry mechanism" if a task cannot be created (to handle release-create
race condition between different threads)

This message was sent by Atlassian JIRA

View raw message