Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E0052200C63 for ; Thu, 27 Apr 2017 00:39:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id DEA76160BB4; Wed, 26 Apr 2017 22:39:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 32713160BA8 for ; Thu, 27 Apr 2017 00:39:09 +0200 (CEST) Received: (qmail 73763 invoked by uid 500); 26 Apr 2017 22:39:08 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 73751 invoked by uid 99); 26 Apr 2017 22:39:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Apr 2017 22:39:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C4C61D59F0 for ; Wed, 26 Apr 2017 22:39:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id GJ-nhV9V6_Lo for ; Wed, 26 Apr 2017 22:39:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 28FA75F485 for ; Wed, 26 Apr 2017 22:39:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 616BFE0185 for ; Wed, 26 Apr 2017 22:39:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 63AA421DEC for ; Wed, 26 Apr 2017 22:39:04 +0000 (UTC) Date: Wed, 26 Apr 2017 22:39:04 +0000 (UTC) From: "Wei-Chiu Chuang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (HDFS-10788) fsck NullPointerException when it encounters corrupt replicas MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 26 Apr 2017 22:39:10 -0000 [ https://issues.apache.org/jira/browse/HDFS-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985669#comment-15985669 ] Wei-Chiu Chuang edited comment on HDFS-10788 at 4/26/17 10:38 PM: ------------------------------------------------------------------ Sorry for the confusion. CDH5.4 and above are based on Apache Hadoop 2.6. But we selectively backport 2.7, 2.8 and even trunk fixes and features. It's sometimes confusing so we have internal tools to map Apache Hadoop jira keys to CDH releases. :) What I want to say is that because of this logistical reason, it is entirely possible a CDH backport missed something else and creates a bug that's exclusive to CDH releases. was (Author: jojochuang): Sorry for the confusion. CDH5.4 and above are based on Apache Hadoop 2.6. But we selectively backport 2.7, 2.8 and even trunk fixes and features. It's sometimes confusing so we have internal tools to map Apache Hadoop jira keys to internal releases. :) > fsck NullPointerException when it encounters corrupt replicas > ------------------------------------------------------------- > > Key: HDFS-10788 > URL: https://issues.apache.org/jira/browse/HDFS-10788 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Environment: CDH5.5.2, CentOS 6.7 > Reporter: Jeff Field > > Somehow (I haven't found root cause yet) we ended up with blocks that have corrupt replicas where the replica count is inconsistent between the blockmap and the corrupt replicas map. If we try to hdfs fsck any parent directory that has a child with one of these blocks, fsck will exit with something like this: > {code} > $ hdfs fsck /path/to/parent/dir/ | egrep -v '^\.+$' > Connecting to namenode via http://mynamenode:50070 > FSCK started by bot-hadoop (auth:KERBEROS_SSL) from /10.97.132.43 for path /path/to/parent/dir/ at Tue Aug 23 20:34:58 UTC 2016 > .........................................................................FSCK ended at Tue Aug 23 20:34:59 UTC 2016 in 1098 milliseconds > null > Fsck on path '/path/to/parent/dir/' FAILED > {code} > So I start at the top, fscking every subdirectory until I find one or more that fails. Then I do the same thing with those directories (our top level directories all have subdirectories with date directories in them, which then contain the files) and once I find a directory with files in it, I run a checksum of the files in that directory. When I do that, I don't get the name of the file, rather I get: > checksum: java.lang.NullPointerException > but since the files are in order, I can figure it out by seeing which file was before the NPE. Once I get to this point, I can see the following in the namenode log when I try to checksum the corrupt file: > 2016-08-23 20:24:59,627 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent number of corrupt replicas for blk_1335893388_1100036319546 blockMap has 0 but corrupt replicas map has 1 > 2016-08-23 20:24:59,627 WARN org.apache.hadoop.ipc.Server: IPC Server handler 23 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 192.168.1.100:47785 Call#1 Retry#0 > java.lang.NullPointerException > At which point I can delete the file, but it is a very tedious process. > Ideally, shouldn't fsck be able to emit the name of the file that is the source of the problem - and (if -delete is specified) get rid of the file, instead of exiting without saying why? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org