kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bharad Tirumala (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5038) running multiple kafka streams instances causes one or more instance to get into file contention
Date Fri, 07 Apr 2017 06:27:42 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960352#comment-15960352
] 

Bharad Tirumala commented on KAFKA-5038:
----------------------------------------

btw, this happens with as few as 5 threads per streams instance. The tasks get distributed
across all the 15 threads in the 3 instances before this file lock contention (mkdir in directoryForTask()
failing).

I've upload the test program with the relevant properties file and also the logs on the instance
when the issue happened.
(I've masked the actual server urls in the logs and in the properties file....)

test code and properties file at:
https://gist.github.com/btirumala/e996b2d09415f3ba42ad58dac4a64507

The debug logs of a failed instance at:
https://gist.github.com/btirumala/4516b3cd31271fab3eaf390a483f07fb


> running multiple kafka streams instances causes one or more instance to get into file
contention
> ------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-5038
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5038
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.2.0
>         Environment: 3 Kafka broker machines and 3 kafka streams machines.
> Each machine is Linux 64 bit, CentOS 6.5 with 64GB memory, 8 vCPUs running in AWS
> 31GB java heap space allocated to each KafkaStreams instance and 4GB allocated to each
Kafka broker.
>            Reporter: Bharad Tirumala
>
> Having multiple kafka streams application instances causes one or more instances to get
get into file lock contention and the instance(s) become unresponsive with uncaught exception.
> The exception is below:
> 22:14:37.621 [StreamThread-7] WARN  o.a.k.s.p.internals.StreamThread - Unexpected state
transition from RUNNING to NOT_RUNNING
> 22:14:37.621 [StreamThread-13] WARN  o.a.k.s.p.internals.StreamThread - Unexpected state
transition from RUNNING to NOT_RUNNING
> 22:14:37.623 [StreamThread-18] WARN  o.a.k.s.p.internals.StreamThread - Unexpected state
transition from RUNNING to NOT_RUNNING
> 22:14:37.625 [StreamThread-7] ERROR n.a.a.k.t.KStreamTopologyBase - Uncaught Exception:org.apache.kafka.streams.errors.ProcessorStateException:
task directory [/data/kafka-streams/rtp-kstreams-metrics/0_119] doesn't exist and couldn't
be created
> 	at org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:75)
> 	at org.apache.kafka.streams.processor.internals.StateDirectory.lock(StateDirectory.java:102)
> 	at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:205)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.maybeClean(StreamThread.java:753)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:664)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)
> This happens within couple of minutes after the instances are up and there is NO data
being sent to the broker yet and the streams app is started with auto.offset.reset set to
"latest".
> Please note that there are no permissions or capacity issues. This may have nothing to
do with number of instances, but I could easily reproduce it when I've 3 stream instances
running. This is similar to the (and may be the same) bug as [KAFKA-3758]
> Here are some relevant configuration info:
> 3 kafka brokers have one topic with 128 partitions and 1 replication
> 3 kafka streams applications (running on 3 machines) have a single processor topology
and this processor is not doing anything (the process() method just returns and the punctuate
method just commits)
> There is no data flowing yet, so the process() and puctuate() methods are not even called
yet.
> The 3 kafka stream instances have 43, 43 and 42 threads each respectively (totally making
up to 128 threads, so one task per thread distributed across three streams instances on 3
machines).
> Here are the configurations that I'd played around with:
> session.timeout.ms=300000
> heartbeat.interval.ms=60000
> max.poll.records=100
> num.standby.replicas=1
> commit.interval.ms=10000
> poll.ms=100
> When punctuate is scheduled to be called every 1000ms or 3000ms, the problem happens
every time. If punctuate is scheduled for 5000ms, I didn't see the problem in my test scenario
(described above), but it happened in my real application. But this may have nothing to do
with the issue, since punctuate is not even called as there are no messages streaming through
yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message