Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 16924 invoked from network); 12 Nov 2007 15:15:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Nov 2007 15:15:49 -0000 Received: (qmail 89124 invoked by uid 500); 12 Nov 2007 15:15:36 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 89093 invoked by uid 500); 12 Nov 2007 15:15:36 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 89084 invoked by uid 99); 12 Nov 2007 15:15:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Nov 2007 07:15:36 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 212.174.130.108 is neither permitted nor denied by domain of enis.soz.nutch@gmail.com) Received: from [212.174.130.108] (HELO mail.agmlab.com) (212.174.130.108) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Nov 2007 15:15:40 +0000 Received: from mail.agmlab.com (localhost [127.0.0.1]) by mail.agmlab.com (Postfix) with ESMTP id 1F1E4D42D9 for ; Mon, 12 Nov 2007 17:15:30 +0200 (EET) Received: from [192.168.15.18] (unknown [192.168.15.18]) by mail.agmlab.com (Postfix) with ESMTP id 133B5D4246 for ; Mon, 12 Nov 2007 17:15:30 +0200 (EET) Message-ID: <47386E05.8020505@gmail.com> Date: Mon, 12 Nov 2007 17:15:17 +0200 From: Enis Soztutar User-Agent: Thunderbird 2.0.0.6 (X11/20071022) MIME-Version: 1.0 To: hadoop-dev@lucene.apache.org Subject: data loss after power recovery Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-Virus-Checked: Checked by ClamAV on apache.org Hi, After a serious power failure on our cluster running 0.13.0, we have been able to restore our previous state. But we have realized that a nontrivial amount of blocks are missing. It seems that namenode has requested all the blocks which are kept on one specific machine to be deleted, which resulted in deletion of all the replicas. To clarify, for some reason all the blocks on the machine as well as all the other replicas of the blocks are deleted by the namenode. Does anyone know what might have happened ? Is this a bug that we should seriously consider fixing, or it may have been already fixed? datanode which caused data loss was : 192.168.15.233, and it is first started as a slave, then removed to add a new hard disk and added back to the cluster Below are the relevant logs : Namenode : 2007-11-11 19:15:11,564 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.15.233:50010 2007-11-11 19:15:12,094 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration fro2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* SafeModeInfo.leave: Safe mode is OFF. ... 2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 1 racks and 36 datanodes 2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks has 56 blocksm 192.168.15.231:50010 storage DS1698199061 ... 2007-11-11 19:30:05,782 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log 2007-11-11 19:30:40,469 INFO org.apache.hadoop.fs.FSNamesystem: Roll FSImage 2007-11-11 19:31:29,913 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.15.236:50010 storage DS1183829041 ... 2007-11-11 19:45:03,483 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.15.233:50010 2007-11-11 19:45:03,734 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.15.233:50010 ... 2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.15.233:50010 2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.15.233:50010 ... and example logs for one of the missing blocks : blk_8859727972037265136 on 192.168.15.203 2007-11-11 19:53:53,755 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136 file /data/hadoop/dfs/data/current/subdir63/subdir63/subdir63/subdir63/subdir49/blk_8859727972037265136 on 192.168.15.225 2007-11-11 20:18:07,964 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136 file /data2/hadoop/dfs/data/current/subdir11/subdir63/blk_8859727972037265136 on 192.168.15.233 2007-11-11 19:54:56,078 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136 file /data/hadoop/dfs/data/current/subdir36/subdir47/blk_8859727972037265136 and the complete log for 192.168.15.233 is : ... 2007-11-11 20:03:37,789 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_3987170016844853189 file /data/hadoop/dfs/data/current/subdir38/blk_3987170016844853189 2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_4414509271638104493 file /data/hadoop/dfs/data/current/subdir56/subdir40/blk_4414509271638104493 2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_4651660909902273726 file /data/hadoop/dfs/data/current/subdir32/subdir3/blk_4651660909902273726 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5189049009734931732 file /data/hadoop/dfs/data/current/subdir56/subdir42/blk_5189049009734931732 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5395031642694782019 file /data/hadoop/dfs/data/current/subdir41/subdir31/blk_5395031642694782019 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5567722351418795177 file /data/hadoop/dfs/data/current/subdir56/subdir42/blk_5567722351418795177 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5592463115430469494 file /data/hadoop/dfs/data/current/subdir10/subdir48/blk_5592463115430469494 ... (for all blocks in the datanode) 2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9219752334498294080. Block not found in blockMap. 2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9217018193785551154. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9211664991594450527. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9211471391608631351. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9208445774532268187. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9202539319669633125. Block not found in blockMap. ... Thanks in advance. Enis Soztutar