kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Kreps (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-600) kafka should respond gracefully rather than crash when unable to write due to ENOSPC
Date Sat, 03 Nov 2012 00:08:12 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489859#comment-13489859
] 

Jay Kreps commented on KAFKA-600:
---------------------------------

This seems sensible but the problem is this. If you get ENOSPC, you likely accepted a partial
write, which means the log itself is corrupt. To fix this we need to run recovery on the log
(which checks each message in the last segment and truncates off any invalid partitial writes).
This happens automatically when the server is restarted. Theoretically this could be done
automatically but since the recovery process can be slow (minutes) and we can't accept writes
during that time we felt the best course of action is to have the node shoot itself in the
head and let other healthy nodes take over.
                
> kafka should respond gracefully rather than crash when unable to write due to ENOSPC
> ------------------------------------------------------------------------------------
>
>                 Key: KAFKA-600
>                 URL: https://issues.apache.org/jira/browse/KAFKA-600
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Tim Stilwell
>
> problem:
> user starts kafka with log.dir value set to a small partition and begins writing data
to the mq.  when the disk partition is full, kafka crashes.  given that this product is used
for both reading and writing operations, crashing seems rather drastic even if the error message
is helpful.   something more robust would be appreciated.  perhaps, logging an error and rejecting
additional write requests while accepting additional read requests?  perhaps, sending an email
alert to Operations?  at least shutdown gracefully so the user is aware that received messages
were saved with a helpful message providing some details of the last message received.  when
tens or hundreds of thousands of messages can be processed in a second, it isn't helpful to
merely log a timestamp and crash.
> steps to reproduce:
> 1) download and install kafka
> 2) modify server.properties
>     # vi /opt/kafka-0.7.2-incubating-src/config/server.properties
>     set log.dir="/var/log/kafka"
> 3) modify log4j
>     # vi /opt/kafka-0.7.2-incubating-src/config/log4j.properties
>     set fileAppender.File=/var/log/kafka/kafka-request.log
> 4) start kafka service
>     $ sudo bash
>     # ulimit -c unlimited
>     # /opt/kafka-0.7.2-incubating-src/bin/kafka-server-start.sh /opt/kafka-0.7.2-incubating-src/config/server.properties
&
> 6) begin writing data to hostname:9092
> 7) review /var/log/kafka-request.log
> results:
> $ grep log.dir /opt/kafka-0.7.2-incubating-src/config/server.properties
> log.dir=/var/log/kafka
> $ df -h /var/log/kafka
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda1       4.0G  4.0G     0 100% /
> $ tail /var/log/kafka/kafka-request.log
> 17627442 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  kafka.server.KafkaZooKeeper
 - Begin registering broker topic /brokers/topics/raw/0 with 1 partitions
> 17627444 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  kafka.server.KafkaZooKeeper
 - End registering broker topic /brokers/topics/raw/0
> 17627445 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  kafka.server.KafkaZooKeeper
 - done re-registering broker
> 18337676 [kafka-processor-3] ERROR kafka.network.Processor  - Closing socket for /10.0.20.138
because of error
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:191)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
>         at kafka.utils.Utils$.read(Utils.scala:538)
>         at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>         at kafka.network.Processor.read(SocketServer.scala:311)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)
> 18391974 [kafka-processor-4] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18422004 [kafka-processor-5] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18434563 [kafka-processor-6] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18485005 [kafka-processor-7] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18497083 [kafka-processor-0] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18525720 [kafka-processor-1] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18543843 [kafka-processor-2] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18563230 [kafka-processor-4] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18575613 [kafka-processor-5] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.138.
> 18677568 [kafka-processor-6] ERROR kafka.network.Processor  - Closing socket for /10.0.20.138
because of error
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:191)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
>         at kafka.utils.Utils$.read(Utils.scala:538)
>         at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>         at kafka.network.Processor.read(SocketServer.scala:311)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)
> 18828016 [kafka-processor-7] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.248.
> 18844274 [kafka-processor-0] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.248.
> 18849691 [kafka-processor-1] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.248.
> 18896883 [kafka-processor-2] INFO  kafka.network.Processor  - Closing socket connection
to /10.0.20.248.
> 22383195 [kafka-processor-2] FATAL kafka.log.Log  - Halting due to unrecoverable I/O
error while handling producer request
> java.io.IOException: No space left on device
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>         at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:59)
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
>         at sun.nio.ch.IOUtil.write(IOUtil.java:60)
>         at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195)
>         at kafka.message.ByteBufferMessageSet.writeTo(ByteBufferMessageSet.scala:76)
>         at kafka.message.FileMessageSet.append(FileMessageSet.scala:159)
>         at kafka.log.LogSegment.append(Log.scala:105)
>         at kafka.log.Log.liftedTree1$1(Log.scala:246)
>         at kafka.log.Log.append(Log.scala:242)
>         at kafka.server.KafkaRequestHandlers.kafka$server$KafkaRequestHandlers$$handleProducerRequest(KafkaRequestHandlers.scala:69)
>         at kafka.server.KafkaRequestHandlers.handleProducerRequest(KafkaRequestHandlers.scala:53)
>         at kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38)
>         at kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38)
>         at kafka.network.Processor.handle(SocketServer.scala:296)
>         at kafka.network.Processor.read(SocketServer.scala:319)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message