Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0A9FF17D43 for ; Wed, 15 Oct 2014 09:52:35 +0000 (UTC) Received: (qmail 87698 invoked by uid 500); 15 Oct 2014 09:52:34 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 87584 invoked by uid 500); 15 Oct 2014 09:52:34 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 87337 invoked by uid 99); 15 Oct 2014 09:52:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Oct 2014 09:52:34 +0000 Date: Wed, 15 Oct 2014 09:52:34 +0000 (UTC) From: "Carrey Zhan (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7253) getBlockLocationsUpdateTimes missing handle exception may cause fsLock dead lock MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Carrey Zhan created HDFS-7253: --------------------------------- Summary: getBlockLocationsUpdateTimes missing handle exception may cause fsLock dead lock Key: HDFS-7253 URL: https://issues.apache.org/jira/browse/HDFS-7253 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0 Reporter: Carrey Zhan One day my active namenode hanged and I dumped the program stacks by jstack.In the stacks file, I saw most threads were waiting FSNamesystem.fsLock, both readLock and writeLock were unacquirable, but no thread was holding writeLock. I tried to access the web interface of this namenode but was blocked. and I tried to failover the active node to another namenode manually (zkfs did not discover this node was hanging) but it was also failed. So I killed this namenode trying to recover the production environment, then the failover was triggered, standby nn transited to active, and then, the new active namenode hanged. My following steps are useless and can be ignored. At last, I thought it was caused by an incorrect lock handling in FSNamesystem.getBlockLocationsUpdateTimes, which I will describe in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)