kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guozhang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-5829) Speedup broker startup after unclean shutdown by reducing unnecessary snapshot files deletion
Date Wed, 04 Oct 2017 23:14:01 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Guozhang Wang updated KAFKA-5829:
    Reviewer: Jason Gustafson  (was: Ismael Juma)

> Speedup broker startup after unclean shutdown by reducing unnecessary snapshot files
> ---------------------------------------------------------------------------------------------
>                 Key: KAFKA-5829
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5829
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Dong Lin
>            Assignee: Dong Lin
>            Priority: Critical
>             Fix For: 1.0.0
> The current Kafka implementation will cause slow startup after unclean shutdown. The
time to load a partition will be 10X or more than what it actually needs. Here is the explanation
with example:
> - Say we have a partition of 20 segments, each segment has 250 message starting with
offset 0. And each message has 1 MB bytes.
> - Broker experiences hard kill and the index file of the first segment is corrupted.
> - When broker startup and load the first segment, it realizes that the index of the first
segment is corrupted. So it calls `log.recoverSegment(...)` to recover this segment. This
method will call `stateManager.truncateAndReload(...)` which deletes the snapshot files whose
offset is larger than base offset of the first segment. Thus all snapshot files are deleted.
> - To rebuild the snapshot files, the `log.loadSegmentFiles(...)` will have to read every
message in this partition even if their log and index files are not corrupted. This will increase
the time to load this partition by more than an order of magnitude.
> In order to address this issue, one simple solution is not to delete snapshot files that
are than the given offset if only the index files needs re-build. More specifically, we should
not need to re-build producer state offset file unless the log file itself is corrupted or

This message was sent by Atlassian JIRA

View raw message