kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neha Narkhede (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (KAFKA-1063) run log cleanup at startup
Date Wed, 06 Nov 2013 02:32:17 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Neha Narkhede reassigned KAFKA-1063:

    Assignee: Neha Narkhede

> run log cleanup at startup
> --------------------------
>                 Key: KAFKA-1063
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1063
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8
>            Reporter: paul mackles
>            Assignee: Neha Narkhede
>            Priority: Minor
>             Fix For: 0.8.1
> Jun suggested I file this ticket to have the brokers start running cleanup at start.
Here is the scenario that precipitated it:
> We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran out of disk on
one of the nodes . As expected, the broker shut itself down and all of the clients switched
over to the other nodes. So far so good. 
> To free up disk space, I reduced log.retention.hours to something more manageable (from
172 to 12). I did this on all 3 nodes. Since the other 2 nodes were running OK, I first tried
to restart the node which ran out of disk. Unfortunately, it kept shutting itself down due
to the full disk. From the logs, I think this was because it was trying to sync-up the replicas
it was responsible for and of course couldn't due to the lack of disk space. My hope was that
upon restart, it would see the new retention settings and free up a bunch of disk space before
trying to do any syncs.
> I then went and restarted the other 2 nodes. They both picked up the new retention settings
and freed up a bunch of storage as a result. I then went back and tried to restart the 3rd
node but to no avail. It still had problems with the full disks.
> I thought about trying to reassign partitions so that the node in question had less to
manage but that turned out to be a hassle so I wound up manually deleting some of the old
log/segment files. The broker seemed to come back fine after that but that's not something
I would want to do on a production server.
> We obviously need better monitoring/alerting to avoid this situation altogether, but
I am wondering if the order of operations at startup could/should be changed to better account
for scenarios like this. Or maybe a utility to remove old logs after changing ttl? Did I miss
a better way to handle this?
> Original email thread is here:
> http://mail-archives.apache.org/mod_mbox/kafka-users/201309.mbox/%3cCE6365AE.82D66%25pmackles@adobe.com%3e

This message was sent by Atlassian JIRA

View raw message