Return-Path: X-Original-To: apmail-incubator-kafka-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-kafka-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96099D089 for ; Sat, 3 Nov 2012 00:08:12 +0000 (UTC) Received: (qmail 83158 invoked by uid 500); 3 Nov 2012 00:08:12 -0000 Delivered-To: apmail-incubator-kafka-dev-archive@incubator.apache.org Received: (qmail 83130 invoked by uid 500); 3 Nov 2012 00:08:12 -0000 Mailing-List: contact kafka-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: kafka-dev@incubator.apache.org Delivered-To: mailing list kafka-dev@incubator.apache.org Received: (qmail 83121 invoked by uid 99); 3 Nov 2012 00:08:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Nov 2012 00:08:12 +0000 Date: Sat, 3 Nov 2012 00:08:12 +0000 (UTC) From: "Jay Kreps (JIRA)" To: kafka-dev@incubator.apache.org Message-ID: <1780509623.63663.1351901292445.JavaMail.jiratomcat@arcas> In-Reply-To: <1399794877.63350.1351896972389.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (KAFKA-600) kafka should respond gracefully rather than crash when unable to write due to ENOSPC MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/KAFKA-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489859#comment-13489859 ] Jay Kreps commented on KAFKA-600: --------------------------------- This seems sensible but the problem is this. If you get ENOSPC, you likely accepted a partial write, which means the log itself is corrupt. To fix this we need to run recovery on the log (which checks each message in the last segment and truncates off any invalid partitial writes). This happens automatically when the server is restarted. Theoretically this could be done automatically but since the recovery process can be slow (minutes) and we can't accept writes during that time we felt the best course of action is to have the node shoot itself in the head and let other healthy nodes take over. > kafka should respond gracefully rather than crash when unable to write due to ENOSPC > ------------------------------------------------------------------------------------ > > Key: KAFKA-600 > URL: https://issues.apache.org/jira/browse/KAFKA-600 > Project: Kafka > Issue Type: Bug > Components: core > Reporter: Tim Stilwell > > problem: > user starts kafka with log.dir value set to a small partition and begins writing data to the mq. when the disk partition is full, kafka crashes. given that this product is used for both reading and writing operations, crashing seems rather drastic even if the error message is helpful. something more robust would be appreciated. perhaps, logging an error and rejecting additional write requests while accepting additional read requests? perhaps, sending an email alert to Operations? at least shutdown gracefully so the user is aware that received messages were saved with a helpful message providing some details of the last message received. when tens or hundreds of thousands of messages can be processed in a second, it isn't helpful to merely log a timestamp and crash. > steps to reproduce: > 1) download and install kafka > 2) modify server.properties > # vi /opt/kafka-0.7.2-incubating-src/config/server.properties > set log.dir="/var/log/kafka" > 3) modify log4j > # vi /opt/kafka-0.7.2-incubating-src/config/log4j.properties > set fileAppender.File=/var/log/kafka/kafka-request.log > 4) start kafka service > $ sudo bash > # ulimit -c unlimited > # /opt/kafka-0.7.2-incubating-src/bin/kafka-server-start.sh /opt/kafka-0.7.2-incubating-src/config/server.properties & > 6) begin writing data to hostname:9092 > 7) review /var/log/kafka-request.log > results: > $ grep log.dir /opt/kafka-0.7.2-incubating-src/config/server.properties > log.dir=/var/log/kafka > $ df -h /var/log/kafka > Filesystem Size Used Avail Use% Mounted on > /dev/sda1 4.0G 4.0G 0 100% / > $ tail /var/log/kafka/kafka-request.log > 17627442 [ZkClient-EventThread-14-10.0.20.242:2181] INFO kafka.server.KafkaZooKeeper - Begin registering broker topic /brokers/topics/raw/0 with 1 partitions > 17627444 [ZkClient-EventThread-14-10.0.20.242:2181] INFO kafka.server.KafkaZooKeeper - End registering broker topic /brokers/topics/raw/0 > 17627445 [ZkClient-EventThread-14-10.0.20.242:2181] INFO kafka.server.KafkaZooKeeper - done re-registering broker > 18337676 [kafka-processor-3] ERROR kafka.network.Processor - Closing socket for /10.0.20.138 because of error > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218) > at sun.nio.ch.IOUtil.read(IOUtil.java:191) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359) > at kafka.utils.Utils$.read(Utils.scala:538) > at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) > at kafka.network.Processor.read(SocketServer.scala:311) > at kafka.network.Processor.run(SocketServer.scala:214) > at java.lang.Thread.run(Thread.java:722) > 18391974 [kafka-processor-4] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18422004 [kafka-processor-5] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18434563 [kafka-processor-6] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18485005 [kafka-processor-7] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18497083 [kafka-processor-0] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18525720 [kafka-processor-1] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18543843 [kafka-processor-2] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18563230 [kafka-processor-4] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18575613 [kafka-processor-5] INFO kafka.network.Processor - Closing socket connection to /10.0.20.138. > 18677568 [kafka-processor-6] ERROR kafka.network.Processor - Closing socket for /10.0.20.138 because of error > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218) > at sun.nio.ch.IOUtil.read(IOUtil.java:191) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359) > at kafka.utils.Utils$.read(Utils.scala:538) > at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) > at kafka.network.Processor.read(SocketServer.scala:311) > at kafka.network.Processor.run(SocketServer.scala:214) > at java.lang.Thread.run(Thread.java:722) > 18828016 [kafka-processor-7] INFO kafka.network.Processor - Closing socket connection to /10.0.20.248. > 18844274 [kafka-processor-0] INFO kafka.network.Processor - Closing socket connection to /10.0.20.248. > 18849691 [kafka-processor-1] INFO kafka.network.Processor - Closing socket connection to /10.0.20.248. > 18896883 [kafka-processor-2] INFO kafka.network.Processor - Closing socket connection to /10.0.20.248. > 22383195 [kafka-processor-2] FATAL kafka.log.Log - Halting due to unrecoverable I/O error while handling producer request > java.io.IOException: No space left on device > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:59) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89) > at sun.nio.ch.IOUtil.write(IOUtil.java:60) > at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195) > at kafka.message.ByteBufferMessageSet.writeTo(ByteBufferMessageSet.scala:76) > at kafka.message.FileMessageSet.append(FileMessageSet.scala:159) > at kafka.log.LogSegment.append(Log.scala:105) > at kafka.log.Log.liftedTree1$1(Log.scala:246) > at kafka.log.Log.append(Log.scala:242) > at kafka.server.KafkaRequestHandlers.kafka$server$KafkaRequestHandlers$$handleProducerRequest(KafkaRequestHandlers.scala:69) > at kafka.server.KafkaRequestHandlers.handleProducerRequest(KafkaRequestHandlers.scala:53) > at kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38) > at kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38) > at kafka.network.Processor.handle(SocketServer.scala:296) > at kafka.network.Processor.read(SocketServer.scala:319) > at kafka.network.Processor.run(SocketServer.scala:214) > at java.lang.Thread.run(Thread.java:722) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira