Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@kafka.apache.org
Date: Fri, 2 Jun 2017 21:29:04 +0000 (UTC)
From: "Kevin Chen (JIRA)" <jira@apache.org>
To: dev@kafka.apache.org
Message-ID: <JIRA.13064015.1492154856000.356587.1496438944175@Atlassian.JIRA>
In-Reply-To: <JIRA.13064015.1492154856000@Atlassian.JIRA>
References: <JIRA.13064015.1492154856000@Atlassian.JIRA> <JIRA.13064015.1492154856303@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (KAFKA-5070)
 org.apache.kafka.streams.errors.LockException: task [0_18] Failed to lock
 the state directory: /opt/rocksdb/pulse10/0_18
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 02 Jun 2017 21:29:09 -0000


    [ https://issues.apache.org/jira/browse/KAFKA-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035452#comment-16035452 ] 

Kevin Chen edited comment on KAFKA-5070 at 6/2/17 9:28 PM:
-----------------------------------------------------------

we saw the similar exceptions in our stream applications too, we are using 10.2 client/server. We have 6 threads per instance. we have 12 partitions on the incoming topic. so total 12 tasks, we are running total 2 instances on 2 nodes(one instance each). 

 We got the following exceptions in different situations
   ! org.apache.kafka.streams.errors.LockException: task [0_1] Failed to lock the state directory: 

at one time, the root cause is rocskdb.flush() exception(we are not using custom implementation), restart fix the problem. only saw that once.

But most time it happens when to re-balance, like we need bring down a node. In this case, no apparent root cause, and restart not always solve it. here is all I got from the stack trace
WARN  [2017-06-02 21:23:47,591] org.apache.kafka.streams.processor.internals.StreamThread: Could not create task 0_6. Will retry.
! org.apache.kafka.streams.errors.LockException: task [0_6] Failed to lock the state directory: /tmp/aggregator-one/kafka-streams-state/aggregator-one-1/0_6
! at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:102)
! at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:73)
! at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:108)
! at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834)
! at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207)
! at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180)
! at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937)
! at org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69)
! at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236)
! at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255)
! at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339)
! at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303)
! at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
! at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030)
! at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
! at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582)
! at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)


was (Author: kchen):
we saw the similar exceptions in our stream applications too, we are using 10.2 client/server. We have 6 threads per instance. we have 12 partitions on the incoming topic. so total 12 tasks, we are running total 2 instances on 2 nodes(one instance each). 

 We got the following exceptions in different situations
   ! org.apache.kafka.streams.errors.LockException: task [0_1] Failed to lock the state directory: 

at one time, the root cause is rocskdb.flush() exception(we are not using custom implementation), restart fix the problem. only saw that once.

But most time it happens when to re-balance, like we need bring down a node. In this case, no apparent root cause, here is all I got from the stack trace
WARN  [2017-06-02 21:23:47,591] org.apache.kafka.streams.processor.internals.StreamThread: Could not create task 0_6. Will retry.
! org.apache.kafka.streams.errors.LockException: task [0_6] Failed to lock the state directory: /tmp/aggregator-one/kafka-streams-state/aggregator-one-1/0_6
! at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:102)
! at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:73)
! at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:108)
! at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834)
! at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207)
! at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180)
! at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937)
! at org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69)
! at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236)
! at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255)
! at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339)
! at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303)
! at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
! at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030)
! at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
! at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582)
! at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)

> org.apache.kafka.streams.errors.LockException: task [0_18] Failed to lock the state directory: /opt/rocksdb/pulse10/0_18
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-5070
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5070
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.2.0
>         Environment: Linux Version
>            Reporter: Dhana
>            Assignee: Matthias J. Sax
>         Attachments: RocksDB_LockStateDirec.7z
>
>
> Notes: we run two instance of consumer in two difference machines/nodes.
> we have 400 partitions. 200  stream threads/consumer, with 2 consumer.
> We perform HA test(on rebalance - shutdown of one of the consumer/broker), we see this happening
> Error:
> 2017-04-05 11:36:09.352 WARN  StreamThread:1184 StreamThread-66 - Could not create task 0_115. Will retry.
> org.apache.kafka.streams.errors.LockException: task [0_115] Failed to lock the state directory: /opt/rocksdb/pulse10/0_115
> 	at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:102)
> 	at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:73)
> 	at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:108)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834)
> 	at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207)
> 	at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69)
> 	at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236)
> 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255)
> 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339)
> 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303)
> 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
> 	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030)
> 	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582)
> 	at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)