hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-4006) TestCheckpoint#testSecondaryHasVeryOutOfDateImage occasionally fails due to unexpected exit
Date Sat, 06 Oct 2012 00:36:02 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HDFS-4006:
------------------------------

    Attachment: hdfs-4006.txt

I think this patch will fix the issue.

The issue was the following:
In testCheckpointTriggerOnTxnCount we were setting up a thread to run the SNN's checkpoint
work loop, but not joining on it in the completion of the test. This was causing a race where
the snn.close() call caused SecondaryNameNode.storage.close() to get called, which cleared
the list of storage directories. Hence the getFsImageName() call was returning null if it
raced with the completion of a checkpoint. I was able to reproduce this reliably by adding
a sleep before the getFsImageName call, and then adding a join on the thread at the end of
the test.

The fix is to actually make the checkpointer thread a member of the SecondaryNameNode, so
that it can be properly shut down.

I also added code to the test that checks for any leftover checkpointer threads between tests
as an extra safeguard against this kind of test bug.
                
> TestCheckpoint#testSecondaryHasVeryOutOfDateImage occasionally fails due to unexpected
exit
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4006
>                 URL: https://issues.apache.org/jira/browse/HDFS-4006
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Todd Lipcon
>              Labels: test-fail
>         Attachments: hdfs-4006.txt, test-log.txt
>
>
> TestCheckpoint#testSecondaryHasVeryOutOfDateImage occasionally fails due to unexpected
exit, due to an NPE while checkpointing. It looks like the background checkpoint fails, conflicts
with the explicit checkpoints done by the tests (note the backtrace is not for the doCheckpoint
calls in the tests.
> {noformat}
> 2012-09-16 01:55:05,901 FATAL hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1355))
- Test resulted in an unexpected exit
> org.apache.hadoop.util.ExitUtil$ExitException: Fatal exception with message null
> stack trace
> java.lang.NullPointerException
> at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:480)
> at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:331)
> at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:298)
> at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
> at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:294)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message