hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
Date Thu, 28 May 2015 16:22:17 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563196#comment-14563196

Daryn Sharp commented on HDFS-8486:

What you'll notice is a spike in corrupt blocks that tapers down.  What's going on is the
DN's block report included all the blocks it deleted.  Over the next 6 hours, the slice scanner
slowly detects missing blocks and reports them as corrupt.  After 6 hours, the directory scanner
detects and mass removes all the missing blocks.

In the 6 hour window, the NN does not know the block is under-replicated and it continues
to send clients to the DN.  Will file a separate bug for the DN not informing the NN when
it's missing a block it thought it had.

> DN startup may cause severe data loss
> -------------------------------------
>                 Key: HDFS-8486
>                 URL: https://issues.apache.org/jira/browse/HDFS-8486
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 0.23.1, 2.0.0-alpha
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Blocker
> A race condition between block pool initialization and the directory scanner may cause
a mass deletion of blocks in multiple storages.
> If block pool initialization finds a block on disk that is already in the replica map,
it deletes one of the blocks based on size, GS, etc.  Unfortunately it _always_ deletes one
of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized.
> The directory scanner starts at a random time within its periodic interval (default 6h).
 If the scanner starts very early it races to populate the replica map, causing the block
pool init to erroneously delete blocks.

This message was sent by Atlassian JIRA

View raw message