Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5A26C200D3B for ; Fri, 10 Nov 2017 10:14:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 58CCB160C03; Fri, 10 Nov 2017 09:14:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 78B62160BF2 for ; Fri, 10 Nov 2017 10:14:06 +0100 (CET) Received: (qmail 96949 invoked by uid 500); 10 Nov 2017 09:14:05 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 96883 invoked by uid 99); 10 Nov 2017 09:14:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Nov 2017 09:14:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DCF3918300A for ; Fri, 10 Nov 2017 09:14:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id flixTybwNRV3 for ; Fri, 10 Nov 2017 09:14:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 0F92161177 for ; Fri, 10 Nov 2017 09:14:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5A550E2595 for ; Fri, 10 Nov 2017 09:14:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id D81E524102 for ; Fri, 10 Nov 2017 09:14:00 +0000 (UTC) Date: Fri, 10 Nov 2017 09:14:00 +0000 (UTC) From: "Ben Corlett (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (KAFKA-6194) Server crash while deleting segments MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 10 Nov 2017 09:14:07 -0000 [ https://issues.apache.org/jira/browse/KAFKA-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247242#comment-16247242 ] Ben Corlett edited comment on KAFKA-6194 at 11/10/17 9:13 AM: -------------------------------------------------------------- So we had another 8 incidents overnight. Maybe this is an issue of my own making. Another change was made shortly after upgrading to 1.0 to debug the heap issues (basically getting jmap to run properly). In our monit config this was changed start program = "/opt/kafka/bin/start.sh" as uid kafka with timeout 600 seconds to: start program = "/opt/kafka/bin/start.sh" as uid kafka and gid kafka with timeout 600 seconds Previously files would be written with: {code} -rw-r--r-- 2 kafka root {code} after the change {code} -rw-r--r-- 1 kafka kafka {code} Now as the uid has full access to all the files and directories I just assumed this wouldn't be an issue. Maybe something is odd about the way files are getting removed. I'm going to recursively chgrp the remaining files and see if we still get the issue. was (Author: corlettb): So we had another 8 incidents overnight. Maybe this is an issue of my own making. Another change was made shortly after upgrading to 1.0 to debug the heap issues (basically getting jmap to run properly). In our monit config this was changed start program = "/opt/kafka/bin/start.sh" as uid kafka with timeout 600 seconds to: start program = "/opt/kafka/bin/start.sh" as uid kafka and gid kafka with timeout 600 seconds Previously files would be written with: {code} -rwxr-xr-x 2 kafka root {code} after the change {code} -rw-r--r-- 1 kafka kafka {code} Now as the uid has full access to all the files and directories I just assumed this wouldn't be an issue. Maybe something is odd about the way files are getting removed. I'm going to recursively chgrp the remaining files and see if we still get the issue. > Server crash while deleting segments > ------------------------------------ > > Key: KAFKA-6194 > URL: https://issues.apache.org/jira/browse/KAFKA-6194 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 1.0.0 > Environment: kafka version: 1.0 > client versions: 0.8.2.1-0.10.2.1 > platform: aws (eu-west-1a) > nodes: 36 x r4.xlarge > disk storage: 2.5 tb per node (~73% usage per node) > topics: 250 > number of partitions: 48k (approx) > os: ubuntu 14.04 > jvm: Java(TM) SE Runtime Environment (build 1.8.0_131-b11), Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode) > Reporter: Ben Corlett > Labels: regression > > We upgraded our R+D cluster to 1.0 in the hope that it would fix the deadlock from 0.11.0.*. Sadly our cluster has the memory leak issues with 1.0 most likely from https://issues.apache.org/jira/browse/KAFKA-6185. We are running one server on a patched version of 1.0 with the pull request from that. > However today we have had two different servers fall over for non-heap related reasons. The exceptions in the kafka log are : > {code} > [2017-11-09 15:32:04,037] ERROR Error while deleting segments for xxxxxxxxxx-49 in dir /mnt/secure/kafka/datalog (kafka.server.LogDirFailureChannel) > java.io.IOException: Delete of log 00000000000000000000.log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.deleteSeg$1(Log.scala:1596) > at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > [2017-11-09 15:32:04,040] INFO [ReplicaManager broker=122] Stopping serving replicas in dir /mnt/secure/kafka/datalog (kafka.server.ReplicaManager) > [2017-11-09 15:32:04,041] ERROR Uncaught exception in scheduled task 'delete-file' (kafka.utils.KafkaScheduler) > org.apache.kafka.common.errors.KafkaStorageException: Error while deleting segments for xxxxxxxxxxxxxx-49 in dir /mnt/secure/kafka/datalog > Caused by: java.io.IOException: Delete of log 00000000000000000000.log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at kafka.log.Log.$anonfun$asyncDeleteSegment$3(Log.scala:1596) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.deleteSeg$1(Log.scala:1596) > at kafka.log.Log.$anonfun$asyncDeleteSegment$4(Log.scala:1599) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > ..... > [2017-11-09 15:32:05,341] ERROR Error while processing data for partition xxxxxxx-83 (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.KafkaStorageException: Replica 122 is in an offline log directory for partition xxxxxxx-83 > [2017-11-09 15:32:05,341] ERROR Error while processing data for partition xxxxxxx-89 (kafka.server.ReplicaFetcherThread) > org.apache.kafka.common.errors.KafkaStorageException: Replica 122 is in an offline log directory for partition xxxxxxx-89 > [2017-11-09 15:32:05,341] ERROR Error while processing data for partition xxxxxxx-76 (kafka.server.ReplicaFetcherThread) > ..... > 2017-11-09 15:32:05,613] WARN [ReplicaManager broker=122] While recording the replica LEO, the partition xxxxxxx-27 hasn't been created. (kafka.server.ReplicaManager) > [2017-11-09 15:32:05,613] WARN [ReplicaManager broker=122] While recording the replica LEO, the partition xxxxxxxxx-79 hasn't been created. (kafka.server.ReplicaManager) > [2017-11-09 15:32:05,622] FATAL Shutdown broker because all log dirs in /mnt/secure/kafka/datalog have failed (kafka.log.LogManager) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)