kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Corlett (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-6194) Server crash while deleting segments
Date Tue, 14 Nov 2017 13:35:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251371#comment-16251371
] 

Ben Corlett commented on KAFKA-6194:
------------------------------------

In the past topics have been deleted and recreated with the same name. Unfortunately this
has resulted in "lingering" topics that had to manually be removed from zookeeper. These days
we version the topics and create new ones with new names. It might still happen but very rarely
will we do this.

In the last 24 hours we have had another 15 failures. I've sanitised (renamed the topics)
the log file from one of them and attached it to the issue.

{code}
[2017-11-14 03:33:30,238] ERROR Uncaught exception in scheduled task 'delete-file' (kafka.utils.KafkaScheduler)
kafka.common.KafkaStorageException: Delete of log 00000000000000000000.log.deleted failed.
        at kafka.log.LogSegment.delete(LogSegment.scala:505)
        at kafka.log.Log.deleteSeg$1(Log.scala:1490)
        at kafka.log.Log.$anonfun$asyncDeleteSegment$2(Log.scala:1492)
        at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
        at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
{code}

I had a brief look though the code and I can see what you mean about the exception only happening
if the delete fails and the file still exists.

As you can see from the logs we have very aggressive rotation/deletion/compaction of 'topic1'.
Probably way more than actually needed. But thats the devs for you.

I wonder if we have a sequence of two deletes followed by a compaction finishing and a new
.delete file appearing so:

1. delete the file (succeeds)
2. delete the file (fails)
3. compaction finishes and new .delete file created
4. check file exists... it does ... throw exception.

Another thing. It always seems to be for 00000000000000000000.log.deleted on all the different
servers.

Maybe we could enable additional logging.

> Server crash while deleting segments
> ------------------------------------
>
>                 Key: KAFKA-6194
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6194
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.0.0
>         Environment: kafka version: 1.0
> client versions: 0.8.2.1-0.10.2.1
> platform: aws (eu-west-1a)
> nodes: 36 x r4.xlarge
> disk storage: 2.5 tb per node (~73% usage per node)
> topics: 250
> number of partitions: 48k (approx)
> os: ubuntu 14.04
> jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) 64-Bit Server
VM (build 25.131-b11, mixed mode)
>            Reporter: Ben Corlett
>              Labels: regression
>         Attachments: server.log.2017-11-14-03.gz
>
>
> We upgraded our R+D cluster to 1.0 in the hope that it would fix the deadlock from 0.11.0.*.
Sadly our cluster has the memory leak issues with 1.0 most likely from https://issues.apache.org/jira/browse/KAFKA-6185.
We are running one server on a patched version of 1.0 with the pull request from that.
> However today we have had two different servers fall over for non-heap related reasons.
The exceptions in the kafka log are :
> {code}
> [2017-11-09 15:32:04,037] ERROR Error while deleting segments for xxxxxxxxxx-49 in dir
/mnt/secure/kafka/datalog (kafka.server.LogDirFailureChannel)
> java.io.IOException: Delete of log 00000000000000000000.log.deleted failed.
>         at kafka.log.LogSegment.delete(LogSegment.scala:496)
>         at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
>         at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>         at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
>         at kafka.log.Log.deleteSeg$1(Log.scala:1596)
>         at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
>         at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
>         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> [2017-11-09 15:32:04,040] INFO [ReplicaManager broker=122] Stopping serving replicas
in dir /mnt/secure/kafka/datalog (kafka.server.ReplicaManager)
> [2017-11-09 15:32:04,041] ERROR Uncaught exception in scheduled task 'delete-file' (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.KafkaStorageException: Error while deleting segments for
xxxxxxxxxxxxxx-49 in dir /mnt/secure/kafka/datalog
> Caused by: java.io.IOException: Delete of log 00000000000000000000.log.deleted failed.
>         at kafka.log.LogSegment.delete(LogSegment.scala:496)
>         at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596)
>         at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>         at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
>         at kafka.log.Log.deleteSeg$1(Log.scala:1596)
>         at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599)
>         at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110)
>         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> .....
> [2017-11-09 15:32:05,341] ERROR Error while processing data for partition xxxxxxx-83
(kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.KafkaStorageException: Replica 122 is in an offline log
directory for partition xxxxxxx-83
> [2017-11-09 15:32:05,341] ERROR Error while processing data for partition xxxxxxx-89
(kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.KafkaStorageException: Replica 122 is in an offline log
directory for partition xxxxxxx-89
> [2017-11-09 15:32:05,341] ERROR Error while processing data for partition xxxxxxx-76
(kafka.server.ReplicaFetcherThread)
> .....
> 2017-11-09 15:32:05,613] WARN [ReplicaManager broker=122] While recording the replica
LEO, the partition xxxxxxx-27 hasn't been created. (kafka.server.ReplicaManager)
> [2017-11-09 15:32:05,613] WARN [ReplicaManager broker=122] While recording the replica
LEO, the partition xxxxxxxxx-79 hasn't been created. (kafka.server.ReplicaManager)
> [2017-11-09 15:32:05,622] FATAL Shutdown broker because all log dirs in /mnt/secure/kafka/datalog
have failed (kafka.log.LogManager)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message