Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A0FBA2009F4 for ; Thu, 26 May 2016 19:36:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9FAA0160A2C; Thu, 26 May 2016 17:36:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E9A3A160A17 for ; Thu, 26 May 2016 19:36:14 +0200 (CEST) Received: (qmail 48305 invoked by uid 500); 26 May 2016 17:36:13 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 47861 invoked by uid 99); 26 May 2016 17:36:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 May 2016 17:36:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E37222C1F61 for ; Thu, 26 May 2016 17:36:12 +0000 (UTC) Date: Thu, 26 May 2016 17:36:12 +0000 (UTC) From: "Greg Fodor (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-3758) KStream job fails to recover after Kafka broker stopped MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 26 May 2016 17:36:15 -0000 [ https://issues.apache.org/jira/browse/KAFKA-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302542#comment-15302542 ] Greg Fodor commented on KAFKA-3758: ----------------------------------- Also, the log is truncated at the top to the point where we shut the broker off. If there's additional useful information in the log before that you think we could share, happy to attach. > KStream job fails to recover after Kafka broker stopped > ------------------------------------------------------- > > Key: KAFKA-3758 > URL: https://issues.apache.org/jira/browse/KAFKA-3758 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 0.10.0.0 > Reporter: Greg Fodor > Assignee: Guozhang Wang > Attachments: muon.log.1.gz > > > We've been doing some testing of a fairly complex KStreams job and under load it seems the job fails to rebalance + recover if we shut down one of the kafka brokers. The test we were running had a 3-node kafka cluster where each topic had at least a replication factor of 2, and we terminated one of the nodes. > Attached is the full log, the root exception seems to be contention on the lock on the state directory. The job continues to try to recover but throws errors relating to locks over and over. Restarting the job itself resolves the problem. > 1702 org.apache.kafka.streams.errors.ProcessorStateException: Error while creating the state manager > 1703 at org.apache.kafka.streams.processor.internals.AbstractTask.(AbstractTask.java:71) > 1704 at org.apache.kafka.streams.processor.internals.StreamTask.(StreamTask.java:86) > 1705 at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:550) > 1706 at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:577) > 1707 at org.apache.kafka.streams.processor.internals.StreamThread.access$000(StreamThread.java:68) > 1708 at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:123) > 1709 at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:222) > 1710 at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$1.onSuccess(AbstractCoordinator.java:232) > 1711 at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$1.onSuccess(AbstractCoordinator.java:227) > 1712 at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133) > 1713 at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107) > 1714 at org.apache.kafka.clients.consumer.internals.RequestFuture$2.onSuccess(RequestFuture.java:182) > 1715 at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133) > 1716 at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107) > 1717 at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:436) > 1718 at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:422) > 1719 at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:679) > 1720 at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:658) > 1721 at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167) > 1722 at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133) > 1723 at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107) > 1724 at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:426) > 1725 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:278) > 1726 at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360) > 1727 at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224) > 1728 at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192) > 1729 at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163) > 1730 at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:243) > 1731 at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:345) > 1732 at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:977) > 1733 at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) > 1734 at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:295) > 1735 at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:218) > 1736 Caused by: java.io.IOException: Failed to lock the state directory: /muon/state/job-stream_photon_messages-1/2_82 > 1737 at org.apache.kafka.streams.processor.internals.ProcessorStateManager.(ProcessorStateManager.java:95) > 1738 at org.apache.kafka.streams.processor.internals.AbstractTask.(AbstractTask.java:69) > 1739 ... 32 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)