kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn Eriksson (JIRA) <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5546) Temporary loss of availability data when the leader is disconnected
Date Sun, 06 Aug 2017 09:56:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115735#comment-16115735

Björn Eriksson commented on KAFKA-5546:

Hi Jason,

No, {{ifdown}} means that the connection won't be shut down cleanly. We're building a fault
tolerant system and we need to test network failures, like hardware failure or a disconnected
network cable.

I've updated the branch to include results for {{ifdown}} and {{kill -9}} (docker
rm). Testing with {{kill -9}} shows better results (2 - 8 seconds) but we'd like guarantees
much lower than that.

The {{ifdown}} test shows that after the _1003_ leader is disconnected (_@11:31:12_) it takes
~2.5 seconds for the producer to realise this and report _Disconnecting from node 1003 due
to request timeout_. Zookeeper reports the new leader to be _1002_ after ~ 6 seconds but the
producer doesn't get wind of the new leader until 14 seconds after the network failure in
spite of it continuously sending metadata requests.

> Temporary loss of availability data when the leader is disconnected
> -------------------------------------------------------------------
>                 Key: KAFKA-5546
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5546
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions:,
>         Environment: docker, failing-network
>            Reporter: Björn Eriksson
> We've noticed that if the leaders networking is deconfigured (with {{ifconfig eth0 down}})
the producer won't notice this and doesn't immediately connect to the newly elected leader.
> {{docker-compose.yml}} and test runner are at https://github.com/owbear/kafka-network-failure-tests.
> We were expecting a transparent failover to the new leader but testing shows that there's
a 8-15 seconds long gap where no values are stored in the log after the network is taken down.
> Tests (and results) [against|https://github.com/owbear/kafka-network-failure-tests/tree/kafka-network-failure-tests-]
> Tests (and results) [against|https://github.com/owbear/kafka-network-failure-tests/tree/kafka-network-failure-tests-]

This message was sent by Atlassian JIRA

View raw message