Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: unknown (athena.apache.org: error in processing during lookup of
 Guy.Doulberg@conduit.com)
Message-ID: <4EE9BAF1.7040801@conduit.com>
Date: Thu, 15 Dec 2011 11:16:33 +0200
From: Guy Doulberg <guy.doulberg@conduit.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:8.0) Gecko/20111124 Thunderbird/8.0
MIME-Version: 1.0
To: <common-user@hadoop.apache.org>
Subject: NameNode - didn't persist the edit log
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

Hi guys,

We recently had the following problem  on our production cluster:

The filesystem containing the editlog and fsimage had no free inodes.
  As a result the namenode wasn't able to obtain an inode for the 
fsimage and  editlog after a checkpiot has been reached, while the 
previous files were freed.
  Unfortunately, we had no monitoring on the inodes number, so it 
happens that the namenode ran in this state for a few hours.

We have noticed this failure in its DFS-status page.

But the namenode didn't enter safe-mode, so all the writes were made 
couldn't be persisted to the editlog.


After discovering the problem we freed inodes, and the file-system 
seemed to be okay again, we tried to force the namenode to persist to 
editlog with no success,

Eventually, we restarted the namenode -which of-course caused us to lose 
all the data that was written to the hdfs during these few hours 
(fortunately we have backup of the recent writes - so we restored the 
data from there )

This situation raises some severe concerns,
1. How come the namenode identified  a failure in persisting its editlog 
and didn't enter safe-mode? (The exception was given only a WARN 
-severity and not a CRITICAL)
2. How come after we freed  inodes, we couldn't persist the namenode? 
Maybe there should be a command in the CLI to should enable us to force 
the namenode to persist its editlog

Do you know of a JIRA opened for these issue, or should I open one?

Thanks Guy