Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A900A200C53 for ; Tue, 11 Apr 2017 22:43:51 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A7B3E160B9B; Tue, 11 Apr 2017 20:43:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EDF5E160B7D for ; Tue, 11 Apr 2017 22:43:50 +0200 (CEST) Received: (qmail 11558 invoked by uid 500); 11 Apr 2017 20:43:45 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 11536 invoked by uid 99); 11 Apr 2017 20:43:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Apr 2017 20:43:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A6C0B1A061E for ; Tue, 11 Apr 2017 20:43:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 35cTXYXj8THI for ; Tue, 11 Apr 2017 20:43:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 9E5D05F5CA for ; Tue, 11 Apr 2017 20:43:42 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E376EE0BDD for ; Tue, 11 Apr 2017 20:43:41 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 9DDB62406A for ; Tue, 11 Apr 2017 20:43:41 +0000 (UTC) Date: Tue, 11 Apr 2017 20:43:41 +0000 (UTC) From: "Eno Thereska (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (KAFKA-5038) running multiple kafka streams instances causes one or more instance to get into file contention MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 11 Apr 2017 20:43:51 -0000 [ https://issues.apache.org/jira/browse/KAFKA-5038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eno Thereska reassigned KAFKA-5038: ----------------------------------- Assignee: Eno Thereska > running multiple kafka streams instances causes one or more instance to get into file contention > ------------------------------------------------------------------------------------------------ > > Key: KAFKA-5038 > URL: https://issues.apache.org/jira/browse/KAFKA-5038 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 0.10.2.0 > Environment: 3 Kafka broker machines and 3 kafka streams machines. > Each machine is Linux 64 bit, CentOS 6.5 with 64GB memory, 8 vCPUs running in AWS > 31GB java heap space allocated to each KafkaStreams instance and 4GB allocated to each Kafka broker. > Reporter: Bharad Tirumala > Assignee: Eno Thereska > > Having multiple kafka streams application instances causes one or more instances to get get into file lock contention and the instance(s) become unresponsive with uncaught exception. > The exception is below: > 22:14:37.621 [StreamThread-7] WARN o.a.k.s.p.internals.StreamThread - Unexpected state transition from RUNNING to NOT_RUNNING > 22:14:37.621 [StreamThread-13] WARN o.a.k.s.p.internals.StreamThread - Unexpected state transition from RUNNING to NOT_RUNNING > 22:14:37.623 [StreamThread-18] WARN o.a.k.s.p.internals.StreamThread - Unexpected state transition from RUNNING to NOT_RUNNING > 22:14:37.625 [StreamThread-7] ERROR n.a.a.k.t.KStreamTopologyBase - Uncaught Exception:org.apache.kafka.streams.errors.ProcessorStateException: task directory [/data/kafka-streams/rtp-kstreams-metrics/0_119] doesn't exist and couldn't be created > at org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:75) > at org.apache.kafka.streams.processor.internals.StateDirectory.lock(StateDirectory.java:102) > at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:205) > at org.apache.kafka.streams.processor.internals.StreamThread.maybeClean(StreamThread.java:753) > at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:664) > at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368) > This happens within couple of minutes after the instances are up and there is NO data being sent to the broker yet and the streams app is started with auto.offset.reset set to "latest". > Please note that there are no permissions or capacity issues. This may have nothing to do with number of instances, but I could easily reproduce it when I've 3 stream instances running. This is similar to the (and may be the same) bug as [KAFKA-3758] > Here are some relevant configuration info: > 3 kafka brokers have one topic with 128 partitions and 1 replication > 3 kafka streams applications (running on 3 machines) have a single processor topology and this processor is not doing anything (the process() method just returns and the punctuate method just commits) > There is no data flowing yet, so the process() and puctuate() methods are not even called yet. > The 3 kafka stream instances have 43, 43 and 42 threads each respectively (totally making up to 128 threads, so one task per thread distributed across three streams instances on 3 machines). > Here are the configurations that I'd played around with: > session.timeout.ms=300000 > heartbeat.interval.ms=60000 > max.poll.records=100 > num.standby.replicas=1 > commit.interval.ms=10000 > poll.ms=100 > When punctuate is scheduled to be called every 1000ms or 3000ms, the problem happens every time. If punctuate is scheduled for 5000ms, I didn't see the problem in my test scenario (described above), but it happened in my real application. But this may have nothing to do with the issue, since punctuate is not even called as there are no messages streaming through yet. -- This message was sent by Atlassian JIRA (v6.3.15#6346)