Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B455F200D18 for ; Wed, 11 Oct 2017 16:27:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B2EC31609CA; Wed, 11 Oct 2017 14:27:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 069E8160BE3 for ; Wed, 11 Oct 2017 16:27:04 +0200 (CEST) Received: (qmail 76599 invoked by uid 500); 11 Oct 2017 14:27:04 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 76363 invoked by uid 99); 11 Oct 2017 14:27:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Oct 2017 14:27:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2EEA919619A for ; Wed, 11 Oct 2017 14:27:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id mLIDhQOmPWar for ; Wed, 11 Oct 2017 14:27:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id F019A60DF2 for ; Wed, 11 Oct 2017 14:27:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1FE31E0F4E for ; Wed, 11 Oct 2017 14:27:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7B804253A1 for ; Wed, 11 Oct 2017 14:27:00 +0000 (UTC) Date: Wed, 11 Oct 2017 14:27:00 +0000 (UTC) From: "Kihwal Lee (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 11 Oct 2017 14:27:05 -0000 [ https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200361#comment-16200361 ] Kihwal Lee commented on HDFS-12638: ----------------------------------- We have also seen blocks with "null" bc staying in the replication queue. They were missing blocks so the replication monitor didn't even try to schedule them and didn't crash. But metaSave was listing them as orphaned (bc == null, deleted). Other than failing over (force queue reinitialization),there was no way to clear them. In your particular case, we can add a null check in {{scheduleReplication()}} in addition to the existing deletion check. The missing block case is a bit trickier, since the replication monitor will not touch them and nothing will move it to a different priority level since the block is already deleted and invalidated on datanodes. We should prevent it from getting added to the queue. In any case, it is apparent that the new {{isDeleted()}} check cannot replace the bc null check 100%. [~jingzhao] any thoughts? > NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets > ----------------------------------------------------------------------------------------------------------- > > Key: HDFS-12638 > URL: https://issues.apache.org/jira/browse/HDFS-12638 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 2.8.2 > Reporter: Jiandan Yang > > Active NamNode exit due to NPE, I can confirm that the BlockCollection passed in when creating ReplicationWork is null, but I do not know why BlockCollection is null, By view history I found [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging whether BlockCollection is null. > NN logs are as following: > {code:java} > 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: ReplicationMonitor thread received Runtime exception. > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55) > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532) > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491) > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792) > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744) > at java.lang.Thread.run(Thread.java:834) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org