hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingliang Liu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
Date Wed, 19 Oct 2016 01:16:58 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mingliang Liu updated HDFS-11030:
---------------------------------
    Description: 
TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the blocks and files
are replicated correctly.

To fail a volume, it deletes all the blocks and sets the data dir read only.
{code:testVolumeFailure()}
    // fail the volume
    // delete/make non-writable one of the directories (failed volume)
    data_fail = new File(dataDir, "data3");
    failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
        cluster.getNamesystem().getBlockPoolId());
    if (failedDir.exists() &&
        //!FileUtil.fullyDelete(failedDir)
        !deteteBlocks(failedDir)
        ) {
      throw new IOException("Could not delete hdfs directory '" + failedDir + "'");
    }
    data_fail.setReadOnly();
    failedDir.setReadOnly();
{code}
However, there are two bugs here:
- The {{failedDir}} directory for finalized blocks is not calculated correctly. It should
use {{data_fail}} instead of {{dataDir}} as the base directory.
- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that there is no subdirectories
in the data dir. This assumption was also noted in the comments.
{quote}
    // we use only small number of blocks to avoid creating subdirs in the data dir..
{quote}
This is not true. On my local cluster and MiniDFSCluster, there will be subdir0/subdir0/ two
level directories regardless of the number of blocks.

These two bugs made the blocks not deleted.

To fail a volume, it also needs to trigger the DataNode removing the volume and send block
report to NN. This is basically in the {{triggerFailure()}} method.
{code}
  private void triggerFailure(String path, long size) throws IOException {
    NamenodeProtocols nn = cluster.getNameNodeRpc();
    List<LocatedBlock> locatedBlocks =
      nn.getBlockLocations(path, 0, size).getLocatedBlocks();
    
    for (LocatedBlock lb : locatedBlocks) {
      DatanodeInfo dinfo = lb.getLocations()[1];
      ExtendedBlock b = lb.getBlock();
      try {
        accessBlock(dinfo, lb);
      } catch (IOException e) {
        System.out.println("Failure triggered, on block: " + b.getBlockId() +  
            "; corresponding volume should be removed by now");
        break;
      }
    }
  }
{code}
Accessing those blocks will not trigger failures if the directory is read-only (while the
block files are all there). I ran the tests multiple times without triggering this failure.
We have to write new block files to the data directories, or we should have deleted the blocks
correctly.

This unit test has been there for years and it seldom fails, just because it's never triggered
a real volume failure.

  was:
TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the blocks and files
are replicated correctly.

To fail a volume, it deletes all the blocks and sets the data dir read only.
{code}
    // fail the volume
    // delete/make non-writable one of the directories (failed volume)
    data_fail = new File(dataDir, "data3");
    failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
        cluster.getNamesystem().getBlockPoolId());
    if (failedDir.exists() &&
        //!FileUtil.fullyDelete(failedDir)
        !deteteBlocks(failedDir)
        ) {
      throw new IOException("Could not delete hdfs directory '" + failedDir + "'");
    }
    data_fail.setReadOnly();
    failedDir.setReadOnly();
{code}
However, there are two bugs here:
- The {{failedDir}} directory for finalized blocks is not calculated correctly. It should
use {{data_fail}} instead of {{dataDir}} as the base directory.
- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that there is no subdirectories
in the data dir. This assumption was also noted in the comments.
{quote}
    // we use only small number of blocks to avoid creating subdirs in the data dir..
{quote}
This is not true. On my local cluster and MiniDFSCluster, there will be subdir0/subdir0/ two
level directories regardless of the number of blocks.

These two bugs made the blocks not deleted.

To fail a volume, it also needs to trigger the DataNode removing the volume and send block
report to NN. This is basically in the {{triggerFailure()}} method.
{code}
  private void triggerFailure(String path, long size) throws IOException {
    NamenodeProtocols nn = cluster.getNameNodeRpc();
    List<LocatedBlock> locatedBlocks =
      nn.getBlockLocations(path, 0, size).getLocatedBlocks();
    
    for (LocatedBlock lb : locatedBlocks) {
      DatanodeInfo dinfo = lb.getLocations()[1];
      ExtendedBlock b = lb.getBlock();
      try {
        accessBlock(dinfo, lb);
      } catch (IOException e) {
        System.out.println("Failure triggered, on block: " + b.getBlockId() +  
            "; corresponding volume should be removed by now");
        break;
      }
    }
  }
{code}
Accessing those blocks will not trigger failures if the directory is read-only (while the
block files are all there). I ran the tests multiple times without triggering this failure.
We have to write new block files to the data directories, or we should have deleted the blocks
correctly.

This unit test has been there for years and it seldom fails, just because it's never triggered
a real volume failure.


> TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
> ---------------------------------------------------------------------
>
>                 Key: HDFS-11030
>                 URL: https://issues.apache.org/jira/browse/HDFS-11030
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, test
>    Affects Versions: 2.7.0
>            Reporter: Mingliang Liu
>            Assignee: Mingliang Liu
>
> TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the blocks and
files are replicated correctly.
> To fail a volume, it deletes all the blocks and sets the data dir read only.
> {code:testVolumeFailure()}
>     // fail the volume
>     // delete/make non-writable one of the directories (failed volume)
>     data_fail = new File(dataDir, "data3");
>     failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
>         cluster.getNamesystem().getBlockPoolId());
>     if (failedDir.exists() &&
>         //!FileUtil.fullyDelete(failedDir)
>         !deteteBlocks(failedDir)
>         ) {
>       throw new IOException("Could not delete hdfs directory '" + failedDir + "'");
>     }
>     data_fail.setReadOnly();
>     failedDir.setReadOnly();
> {code}
> However, there are two bugs here:
> - The {{failedDir}} directory for finalized blocks is not calculated correctly. It should
use {{data_fail}} instead of {{dataDir}} as the base directory.
> - When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that there is
no subdirectories in the data dir. This assumption was also noted in the comments.
> {quote}
>     // we use only small number of blocks to avoid creating subdirs in the data dir..
> {quote}
> This is not true. On my local cluster and MiniDFSCluster, there will be subdir0/subdir0/
two level directories regardless of the number of blocks.
> These two bugs made the blocks not deleted.
> To fail a volume, it also needs to trigger the DataNode removing the volume and send
block report to NN. This is basically in the {{triggerFailure()}} method.
> {code}
>   private void triggerFailure(String path, long size) throws IOException {
>     NamenodeProtocols nn = cluster.getNameNodeRpc();
>     List<LocatedBlock> locatedBlocks =
>       nn.getBlockLocations(path, 0, size).getLocatedBlocks();
>     
>     for (LocatedBlock lb : locatedBlocks) {
>       DatanodeInfo dinfo = lb.getLocations()[1];
>       ExtendedBlock b = lb.getBlock();
>       try {
>         accessBlock(dinfo, lb);
>       } catch (IOException e) {
>         System.out.println("Failure triggered, on block: " + b.getBlockId() +  
>             "; corresponding volume should be removed by now");
>         break;
>       }
>     }
>   }
> {code}
> Accessing those blocks will not trigger failures if the directory is read-only (while
the block files are all there). I ran the tests multiple times without triggering this failure.
We have to write new block files to the data directories, or we should have deleted the blocks
correctly.
> This unit test has been there for years and it seldom fails, just because it's never
triggered a real volume failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message