hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
Date Thu, 20 Oct 2016 23:33:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593363#comment-15593363
] 

Hadoop QA commented on HDFS-11030:
----------------------------------

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 51s{color} | {color:blue}
Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color}
| {color:green} The patch appears to include 1 new or modified test files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 39s{color}
| {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 38s{color} |
{color:green} branch-2 passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 44s{color} |
{color:green} branch-2 passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 28s{color}
| {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 52s{color} |
{color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 16s{color}
| {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 57s{color} |
{color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 54s{color} |
{color:green} branch-2 passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 34s{color} |
{color:green} branch-2 passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 43s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 36s{color} |
{color:green} the patch passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 36s{color} | {color:green}
the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 40s{color} |
{color:green} the patch passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 40s{color} | {color:green}
the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 24s{color}
| {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 50 unchanged
- 7 fixed = 50 total (was 57) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 49s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 13s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  9s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 50s{color} |
{color:green} the patch passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 32s{color} |
{color:green} the patch passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 48m 20s{color} | {color:green}
hadoop-hdfs in the patch passed with JDK v1.7.0_111. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 22s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}134m 54s{color} | {color:black}
{color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_101 Failed junit tests | hadoop.hdfs.TestEncryptionZones |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:b59b8b7 |
| JIRA Issue | HDFS-11030 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12834547/HDFS-11030-branch-2.000.patch
|
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit  findbugs
 checkstyle  |
| uname | Linux 2ebe64861e73 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
| git revision | branch-2 / 1f384b6 |
| Default Java | 1.7.0_111 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
|
| findbugs | v3.0.0 |
| JDK v1.7.0_111  Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/17238/testReport/
|
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
| Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/17238/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
> ---------------------------------------------------------------------
>
>                 Key: HDFS-11030
>                 URL: https://issues.apache.org/jira/browse/HDFS-11030
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, test
>    Affects Versions: 2.7.0
>            Reporter: Mingliang Liu
>            Assignee: Mingliang Liu
>         Attachments: HDFS-11030-branch-2.000.patch, HDFS-11030.000.patch
>
>
> TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the blocks and
files are replicated correctly.
> # To fail a volume, it deletes all the blocks and sets the data dir read only.
> {code:title=testVolumeFailure() snippet}
>     // fail the volume
>     // delete/make non-writable one of the directories (failed volume)
>     data_fail = new File(dataDir, "data3");
>     failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
>         cluster.getNamesystem().getBlockPoolId());
>     if (failedDir.exists() &&
>         //!FileUtil.fullyDelete(failedDir)
>         !deteteBlocks(failedDir)
>         ) {
>       throw new IOException("Could not delete hdfs directory '" + failedDir + "'");
>     }
>     data_fail.setReadOnly();
>     failedDir.setReadOnly();
> {code}
> However, there are two bugs here, which make the blocks not deleted.
> #- The {{failedDir}} directory for finalized blocks is not calculated correctly. It should
use {{data_fail}} instead of {{dataDir}} as the base directory.
> #- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that there is
no subdirectories in the data dir. This assumption was also noted in the comments.
> {quote}
>     // we use only small number of blocks to avoid creating subdirs in the data dir..
> {quote}
> This is not true. On my local cluster and MiniDFSCluster, there will be subdir0/subdir0/
two level directories regardless of the number of blocks.
> # Meanwhile, to fail a volume, it also needs to trigger the DataNode removing the volume
and send block report to NN. This is basically in the {{triggerFailure()}} method.
> {code}
>   private void triggerFailure(String path, long size) throws IOException {
>     NamenodeProtocols nn = cluster.getNameNodeRpc();
>     List<LocatedBlock> locatedBlocks =
>       nn.getBlockLocations(path, 0, size).getLocatedBlocks();
>     
>     for (LocatedBlock lb : locatedBlocks) {
>       DatanodeInfo dinfo = lb.getLocations()[1];
>       ExtendedBlock b = lb.getBlock();
>       try {
>         accessBlock(dinfo, lb);
>       } catch (IOException e) {
>         System.out.println("Failure triggered, on block: " + b.getBlockId() +  
>             "; corresponding volume should be removed by now");
>         break;
>       }
>     }
>   }
> {code}
> Accessing those blocks will not trigger failures if the directory is read-only (while
the block files are all there). I ran the tests multiple times without triggering this failure.
We have to write new block files to the data directories, or we should have deleted the blocks
correctly. I think we need to add some assertion code after triggering the volume failure.
The assertions should check the datanode volume failure summary explicitly to make sure a
volume failure is triggered (and noticed).
> # To make sure the NameNode be aware of the volume failure, the code explictily send
block reports to NN.
> {code:title=TestDataNodeVolumeFailure#testVolumeFailure()}
>     cluster.getNameNodeRpc().blockReport(dnR, bpid, reports,
>         new BlockReportContext(1, 0, System.nanoTime(), 0, false));
> {code}
> Generating block report code is complex, which is actually the internal logic of {{BPServiceActor}}.
We may have to update this code it changes. In fact, the volume failure is now sent by DataNode
via heartbeats. We should trigger a heartbeat request here; and make sure the NameNode handles
the heartbeat before we verify the block states.
> # When verifying via {{verify()}}, it counts the real block files and assert that real
block files plus underreplicated blocks should cover all blocks. Before counting underreplicated
blocks, it triggered the {{BlockManager}} to compute the datanode work:
> {code}
>     // force update of all the metric counts by calling computeDatanodeWork
>     BlockManagerTestUtil.getComputedDatanodeWork(fsn.getBlockManager());
> {code}
> However, counting physical block files and underreplicated blocks are not atomic. The
NameNode will inform of the DataNode the computed work at next heartbeat. So I think this
part of code may fail when some blocks are replicated and the number of physical block files
is made stale. To avoid this case, I think we should keep the DataNode from sending the heartbeat
after that. A simple solution is to set {{dfs.heartbeat.interval}} long enough.
> This unit test has been there for years and it seldom fails, just because it's never
triggered a real volume failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message