hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-10788) fsck NullPointerException when it encounters corrupt replicas
Date Wed, 24 Aug 2016 21:23:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435658#comment-15435658

Wei-Chiu Chuang edited comment on HDFS-10788 at 8/24/16 9:23 PM:

Thanks [~kshukla] for confirming my guess. I also traced the code and found {{ClientProtocol.getBlockLocations}}
indirectly calls {{BlockManager#createLocatedBlocks}}. CDH5.5.2 is GA early this year before
HDFS-9985 was committed so it does not have the fix HDFS-9985.

was (Author: jojochuang):
Thanks [~kshukla] for confirming my guess. I also traced the code and found {{ClientProtocol.getBlockLocations}}
indirectly calls {{BlockManager#createLocatedBlocks}}. CDH5.5.2 is GA before Hadoop 2.7.3
so it does not have the fix HDFS-9985.

> fsck NullPointerException when it encounters corrupt replicas
> -------------------------------------------------------------
>                 Key: HDFS-10788
>                 URL: https://issues.apache.org/jira/browse/HDFS-10788
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>         Environment: CDH5.5.2, CentOS 6.7
>            Reporter: Jeff Field
> Somehow (I haven't found root cause yet) we ended up with blocks that have corrupt replicas
where the replica count is inconsistent between the blockmap and the corrupt replicas map.
If we try to hdfs fsck any parent directory that has a child with one of these blocks, fsck
will exit with something like this:
> {code}
> $ hdfs fsck /path/to/parent/dir/ | egrep -v '^\.+$'
> Connecting to namenode via http://mynamenode:50070
> FSCK started by bot-hadoop (auth:KERBEROS_SSL) from / for path /path/to/parent/dir/
at Tue Aug 23 20:34:58 UTC 2016
> .........................................................................FSCK ended at
Tue Aug 23 20:34:59 UTC 2016 in 1098 milliseconds
> null
> Fsck on path '/path/to/parent/dir/' FAILED
> {code}
> So I start at the top, fscking every subdirectory until I find one or more that fails.
Then I do the same thing with those directories (our top level directories all have subdirectories
with date directories in them, which then contain the files) and once I find a directory with
files in it, I run a checksum of the files in that directory. When I do that, I don't get
the name of the file, rather I get:
> checksum: java.lang.NullPointerException
> but since the files are in order, I can figure it out by seeing which file was before
the NPE. Once I get to this point, I can see the following in the namenode log when I try
to checksum the corrupt file:
> 2016-08-23 20:24:59,627 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Inconsistent number of corrupt replicas for blk_1335893388_1100036319546 blockMap has 0 but
corrupt replicas map has 1
> 2016-08-23 20:24:59,627 WARN org.apache.hadoop.ipc.Server: IPC Server handler 23 on 8020,
call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from
Call#1 Retry#0
> java.lang.NullPointerException
> At which point I can delete the file, but it is a very tedious process.
> Ideally, shouldn't fsck be able to emit the name of the file that is the source of the
problem - and (if -delete is specified) get rid of the file, instead of exiting without saying

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message