hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Su (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5280) Corrupted meta files on data nodes prevents DFClient from connecting to data nodes and updating corruption status to name node.
Date Wed, 27 Apr 2016 03:46:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259490#comment-15259490

Walter Su commented on HDFS-5280:

There's other IOExceptions will cause readBlock RPC call fails, then cause the dn marked as
dead. We could fix them as well.
If I understand correctly, you approach is to use a fake checksum. When client reads data,
the check failed, and client will mark block as corrupted instead of mark dn as dead. I think,
can we let client not to read from this dn at first? If client fails to create blockreader,
it can tell if the dn is dead or it's just the block is corrupted.

 652       try {
 653         blockReader = getBlockReader(targetBlock, offsetIntoBlock,
 654             targetBlock.getBlockSize() - offsetIntoBlock, targetAddr,
 655             storageType, chosenNode);
 656         if(connectFailedOnce) {
 657           DFSClient.LOG.info("Successfully connected to " + targetAddr +
 658                              " for " + targetBlock.getBlock());
 659         }
 660         return chosenNode;
 661       } catch (IOException ex) {
 662         if (ex instanceof InvalidEncryptionKeyException && refetchEncryptionKey
> 0) {
 672         } else {
 677           addToDeadNodes(chosenNode);
 678         }
 679       }
 680     }
 681   }
Instead of going to {{else}} clause, can we have another Exception like {{InvalidEncryptionKeyException}},
if we catch it, we skip the dn, and do not add it to dead nodes.

> Corrupted meta files on data nodes prevents DFClient from connecting to data nodes and
updating corruption status to name node.
> -------------------------------------------------------------------------------------------------------------------------------
>                 Key: HDFS-5280
>                 URL: https://issues.apache.org/jira/browse/HDFS-5280
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 1.1.1, 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.7.2
>         Environment: Red hat enterprise 6.4
> Hadoop-2.1.0
>            Reporter: Jinghui Wang
>            Assignee: Andres Perez
>         Attachments: HDFS-5280.patch
> Meta files being corrupted causes the DFSClient not able to connect to the datanodes
to access the blocks, so DFSClient never perform a read on the block, which is what throws
the ChecksumException when file blocks are corrupted and report to the namenode to mark the
block as corrupt.  Since the client never got to that far, thus the file status remain as
healthy and so are all the blocks.
> To replicate the error, put a file onto HDFS.
> run hadoop fsck /tmp/bogus.csv -files -blocks -location will get that following output.
> FSCK started for path /tmp/bogus.csv at 11:33:29
> /tmp/bogus.csv 109 bytes, 1 block(s):  OK
> 0. blk_-4255166695856420554_5292 len=109 repl=3
> find the block/meta files for 4255166695856420554 by running 
> ssh datanode1.address find /hadoop/ -name "*4255166695856420554*" and it will get the
following output:
> /hadoop/data1/hdfs/current/subdir2/blk_-4255166695856420554
> /hadoop/data1/hdfs/current/subdir2/blk_-4255166695856420554_5292.meta
> now corrupt the meta file by running 
> ssh datanode1.address "sed -i -e '1i 1234567891' /hadoop/data1/hdfs/current/subdir2/blk_-4255166695856420554_5292.meta"

> now run hadoop fs -cat /tmp/bogus.csv
> will show the stack trace of DFSClient failing to connect to the data node with the corrupted
meta file.

This message was sent by Atlassian JIRA

View raw message