kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4228) Sender thread death leaves KafkaProducer in a bad state
Date Thu, 29 Sep 2016 00:28:20 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15531356#comment-15531356

ASF GitHub Bot commented on KAFKA-4228:

GitHub user radai-rosenblatt opened a pull request:


    KAFKA-4228 - make producer close on sender thread death, make consumer shutdown on failure
to rebalance, and make MM die on any of the above.

    the JIRA issue (https://issues.apache.org/jira/browse/KAFKA-4228) details a cascade of
failures that resulted in an entire mirror maker cluster stalling due to an OOM death on one
mm instance.
    this patch makes producers and consumers close themselves on the errors encountered, and
mm to shut down if anything happens to producers or consumers.
    Signed-off-by: radai-rosenblatt <radai.rosenblatt@gmail.com>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/radai-rosenblatt/kafka honorable-death

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1930
commit efca6e8dfa65acb5fbb1bc1801fda151b08f4f81
Author: radai-rosenblatt <radai.rosenblatt@gmail.com>
Date:   2016-09-29T00:16:16Z

    KAFKA-4228 - make producer close on sender thread death, make consumer shutdown on failure
to rebalance, and make MM die on any of the above.
    Signed-off-by: radai-rosenblatt <radai.rosenblatt@gmail.com>


> Sender thread death leaves KafkaProducer in a bad state
> -------------------------------------------------------
>                 Key: KAFKA-4228
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4228
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions:
>            Reporter: radai rosenblatt
> a KafkaProducer's Sender thread may die:
> {noformat}
> 2016/09/28 00:28:01.065 ERROR [KafkaThread] [kafka-producer-network-thread | mm_ei-lca1_uniform]
[kafka-mirror-maker] [] Uncaught exception in kafka-producer-network-thread | mm_ei-lca1_uni
> java.lang.OutOfMemoryError: Java heap space
>        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[?:1.8.0_40]
>        at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[?:1.8.0_40]
>        at org.apache.kafka.common.requests.RequestSend.serialize(RequestSend.java:35)
>        at org.apache.kafka.common.requests.RequestSend.<init>(RequestSend.java:29)
>        at org.apache.kafka.clients.producer.internals.Sender.produceRequest(Sender.java:355)
>        at org.apache.kafka.clients.producer.internals.Sender.createProduceRequests(Sender.java:337)
>        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:211) ~[kafka-clients-]
>        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:134) ~[kafka-clients-]
>        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_40]
> {noformat}
> which leaves the producer in a bad state. in this state, a call to flush(), for example,
will hang indefinitely as the sender thread is not around to flush batches but theyve not
been aborted.
> even worse, this can happen in MirrorMaker just before a rebalance, at which point MM
will just block indefinitely during a rebalance (in beforeReleasingPartitions()).
> a rebalance participant hung in such a way will cause rebalance to fail for the rest
of the participants, at which point ZKRebalancerListener.watcherExecutorThread() dies to an
exception (cannot rebalance after X attempts) but the consumer that ran the thread will remain
live. the end result is a bunch of zombie mirror makers and orphan topic partitions.
> a dead sender thread should result in closing the producer.
> a consumer failing to rebalance should shut down.
> any issue with the producer or consumer should cause mirror-maker death.

This message was sent by Atlassian JIRA

View raw message