Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AC685787E for ; Thu, 15 Dec 2011 09:17:16 +0000 (UTC) Received: (qmail 94369 invoked by uid 500); 15 Dec 2011 09:17:12 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 94266 invoked by uid 500); 15 Dec 2011 09:17:09 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 94022 invoked by uid 99); 15 Dec 2011 09:17:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Dec 2011 09:17:06 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= X-Spam-Check-By: apache.org Received-SPF: unknown (athena.apache.org: error in processing during lookup of Guy.Doulberg@conduit.com) Received: from [64.78.22.19] (HELO EXHUB017-4.exch017.msoutlookonline.net) (64.78.22.19) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Dec 2011 09:16:57 +0000 Received: from [192.168.20.138] (82.166.52.154) by smtpx17.msoutlookonline.net (64.78.22.39) with Microsoft SMTP Server (TLS) id 8.3.213.0; Thu, 15 Dec 2011 01:16:37 -0800 Message-ID: <4EE9BAF1.7040801@conduit.com> Date: Thu, 15 Dec 2011 11:16:33 +0200 From: Guy Doulberg User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20111124 Thunderbird/8.0 MIME-Version: 1.0 To: Subject: NameNode - didn't persist the edit log Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Hi guys, We recently had the following problem on our production cluster: The filesystem containing the editlog and fsimage had no free inodes. As a result the namenode wasn't able to obtain an inode for the fsimage and editlog after a checkpiot has been reached, while the previous files were freed. Unfortunately, we had no monitoring on the inodes number, so it happens that the namenode ran in this state for a few hours. We have noticed this failure in its DFS-status page. But the namenode didn't enter safe-mode, so all the writes were made couldn't be persisted to the editlog. After discovering the problem we freed inodes, and the file-system seemed to be okay again, we tried to force the namenode to persist to editlog with no success, Eventually, we restarted the namenode -which of-course caused us to lose all the data that was written to the hdfs during these few hours (fortunately we have backup of the recent writes - so we restored the data from there ) This situation raises some severe concerns, 1. How come the namenode identified a failure in persisting its editlog and didn't enter safe-mode? (The exception was given only a WARN -severity and not a CRITICAL) 2. How come after we freed inodes, we couldn't persist the namenode? Maybe there should be a command in the CLI to should enable us to force the namenode to persist its editlog Do you know of a JIRA opened for these issue, or should I open one? Thanks Guy