hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9631) Restarting namenode after deleting a directory with snapshot will fail
Date Fri, 08 Jan 2016 07:07:39 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HDFS-9631:
----------------------------------
    Description: 
I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite frequently. 
These tests ({{testParentDirWithUCFileDeleteWithSnapshot}}, {{testOpenFilesWithRename}}, {{testWithCheckpoint}})
are unable to reconnect to the namenode after restart. It looks like the reconnection failed
due to an EOFException between data node and the name node.
{noformat}
FAILED:  org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot

Error Message:
Timed out waiting for Mini HDFS Cluster to start

Stack Trace:
java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
	at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
	at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
	at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
	at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
{noformat}

It appears that these three tests all call doWriteAndAbort(), which creates files and then
abort, and then set the parent directory with a snapshot, and then delete the parent directory.


Interestingly, if the parent directory does not have a snapshot, the tests will not fail.

The following test will fail intermittently:
{code:java}
public void testDeleteParentDirWithSnapShot() throws Exception {
    Path path = new Path("/test");
    fs.mkdirs(path);
    fs.allowSnapshot(path);
    Path file = new Path("/test/test/test2");
    FSDataOutputStream out = fs.create(file);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));
    DFSTestUtil.abortStream((DFSOutputStream) out.getWrappedStream());

    Path file2 = new Path("/test/test/test3");
    FSDataOutputStream out2 = fs.create(file2);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out2.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));
    DFSTestUtil.abortStream((DFSOutputStream) out2.getWrappedStream());

    fs.createSnapshot(path, "s1");
    // delete parent directory
    fs.delete(new Path("/test/test"), true);
    cluster.restartNameNode();
  }
{code}

I am not sure if it's a test case issue, or something to do with snapshots.

  was:
I found a number of TestOpenFilesWithSnapshot tests failed quite frequently. 
These tests (testParentDirWithUCFileDeleteWithSnapshot, testOpenFilesWithRename, testWithCheckpoint)
are unable to reconnect to the namenode after restart. It looks like the reconnection failed
due to an EOFException between data node and the name node.

It appears that these three tests all call doWriteAndAbort(), which creates files and then
abort, and then set the parent directory with a snapshot, and then delete the parent directory.


Interestingly, if the parent directory does not have a snapshot, the tests will not fail.

The following test will fail intermittently:
{code:java}
public void testDeleteParentDirWithSnapShot() throws Exception {
    Path path = new Path("/test");
    fs.mkdirs(path);
    fs.allowSnapshot(path);
    Path file = new Path("/test/test/test2");
    FSDataOutputStream out = fs.create(file);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));
    DFSTestUtil.abortStream((DFSOutputStream) out.getWrappedStream());

    Path file2 = new Path("/test/test/test3");
    FSDataOutputStream out2 = fs.create(file2);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out2.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));
    DFSTestUtil.abortStream((DFSOutputStream) out2.getWrappedStream());

    fs.createSnapshot(path, "s1");
    // delete parent directory
    fs.delete(new Path("/test/test"), true);
    cluster.restartNameNode();
  }
{code}

I am not sure if it's a test case issue, or something to do with snapshots.


> Restarting namenode after deleting a directory with snapshot will fail
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9631
>                 URL: https://issues.apache.org/jira/browse/HDFS-9631
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>
> I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite frequently. 
> These tests ({{testParentDirWithUCFileDeleteWithSnapshot}}, {{testOpenFilesWithRename}},
{{testWithCheckpoint}}) are unable to reconnect to the namenode after restart. It looks like
the reconnection failed due to an EOFException between data node and the name node.
> {noformat}
> FAILED:  org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot
> Error Message:
> Timed out waiting for Mini HDFS Cluster to start
> Stack Trace:
> java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
> 	at org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
> {noformat}
> It appears that these three tests all call doWriteAndAbort(), which creates files and
then abort, and then set the parent directory with a snapshot, and then delete the parent
directory. 
> Interestingly, if the parent directory does not have a snapshot, the tests will not fail.
> The following test will fail intermittently:
> {code:java}
> public void testDeleteParentDirWithSnapShot() throws Exception {
>     Path path = new Path("/test");
>     fs.mkdirs(path);
>     fs.allowSnapshot(path);
>     Path file = new Path("/test/test/test2");
>     FSDataOutputStream out = fs.create(file);
>     for (int i = 0; i < 2; i++) {
>       long count = 0;
>       while (count < 1048576) {
>         out.writeBytes("hell");
>         count += 4;
>       }
>     }
>     ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
>         .of(SyncFlag.UPDATE_LENGTH));
>     DFSTestUtil.abortStream((DFSOutputStream) out.getWrappedStream());
>     Path file2 = new Path("/test/test/test3");
>     FSDataOutputStream out2 = fs.create(file2);
>     for (int i = 0; i < 2; i++) {
>       long count = 0;
>       while (count < 1048576) {
>         out2.writeBytes("hell");
>         count += 4;
>       }
>     }
>     ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
>         .of(SyncFlag.UPDATE_LENGTH));
>     DFSTestUtil.abortStream((DFSOutputStream) out2.getWrappedStream());
>     fs.createSnapshot(path, "s1");
>     // delete parent directory
>     fs.delete(new Path("/test/test"), true);
>     cluster.restartNameNode();
>   }
> {code}
> I am not sure if it's a test case issue, or something to do with snapshots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message