hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-8486) DN startup may cause severe data loss
Date Thu, 28 May 2015 18:11:26 GMT

     [ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Daryn Sharp updated HDFS-8486:
    Attachment: HDFS-8486.patch

After multiple iterations, this is simplest low-risk patch.  The crucial part is the BlockPoolSlice
realizes it's discovered an on-block disk that has the same path as in-memory.  In which case
it updates the replica map with the one just found.

The other part is avoiding the race altogether.  The directory scan should not occur until
after the block pools are initialized.  Although both should be able to  "work" simultaneously,
until initialized the first time, the directory scanner warns there's no block scanner for
every new block it finds.

Note I found writing a unit test to be extremely difficult.  The BlockPoolSlice ctor has numerous
side-effects.  I instead split out part of duplicate resolution into a static method (sigh,
makes future mocking impossible).

> DN startup may cause severe data loss
> -------------------------------------
>                 Key: HDFS-8486
>                 URL: https://issues.apache.org/jira/browse/HDFS-8486
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 0.23.1, 2.0.0-alpha
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Blocker
>         Attachments: HDFS-8486.patch
> A race condition between block pool initialization and the directory scanner may cause
a mass deletion of blocks in multiple storages.
> If block pool initialization finds a block on disk that is already in the replica map,
it deletes one of the blocks based on size, GS, etc.  Unfortunately it _always_ deletes one
of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized.
> The directory scanner starts at a random time within its periodic interval (default 6h).
 If the scanner starts very early it races to populate the replica map, causing the block
pool init to erroneously delete blocks.

This message was sent by Atlassian JIRA

View raw message