hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chathuri Wimalasena <kamalas...@gmail.com>
Subject How to recover from CORRUPT HDFS state
Date Tue, 27 Dec 2016 19:54:41 GMT

We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are
running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on
logging node 2. We are facing a terrible issue with our hadoop cluster
recently. There are lot of files in HDFS in corrupt state. We are unable to
figure out what cause this mass corruption and how to recover from it. HDFS
has 40 TB of data and we are worried that we might have to rebuild the
cluster from scratch due to this errors. Our cluster had some file system
issues recently. Below is the list of events that took place before that.
Both Hadoop and HBase are running on ln02 (logging node 2). ​

   - Nov 30 - SSD drives on ln02 node has died which triggered a kernel
   panic and reboot.
   - Dec 20 - ln02 file system set to Read-only and both hard drives on
   ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and
   rebooted, and it came back up. One data node was also down on the same day
   due to disk failure.
   - Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys
   admin replaced the failed SSD with another SSD. Another data node was down
   on the same day.​

On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to
restart Hadoop and HBase without any issue. Everything worked as expected.
But on Dec 21st, when I restarted Hadoop, it has automatically switch to
the "Safe mode" and hadoop fs fsck command showed lot of corrupt and
missing files. Output of fsck is below.
............................Status: CORRUPT
 Total size:    46454858557036 B (Total open files size: 1340 B)
 Total dirs:    43405
 Total files:   122028
 Total symlinks:                0 (Files currently being written: 10)
 Total blocks (validated):      804832 (avg. block size 57719944 B) (Total
open file blocks (not validated): 10)
  UNDER MIN REPL'D BLOCKS:      413578 (51.386875 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        18683
  MISSING BLOCKS:       413578
  MISSING SIZE:         26785603097998 B
  CORRUPT BLOCKS:       413578
 Minimally replicated blocks:   391254 (48.613125 %)
 Over-replicated blocks:        26548 (3.2985766 %)
 Under-replicated blocks:       286 (0.035535365 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     1.4916517
 Corrupt blocks:                413578
 Missing replicas:              572 (0.023681387 %)
 Number of data-nodes:          10
 Number of racks:               1
FSCK ended at Sat Dec 24 13:25:10 EST 2016 in 8378 milliseconds

The filesystem under path '/' is CORRUPT

HDFS web ui shows below message.

*Safe mode is ON. The reported blocks 391254 needs additional 412774 blocks
to reach the threshold 0.9990 of total blocks 804832. The number of live
datanodes 10 has reached the minimum number 0. Safe mode will be turned off
automatically once the thresholds have been reached.*

We experienced some data nodes showing Input/output errors intermittently
as well.

Anyone experienced such situation before and any idea to recover from this
is greatly appreciated.

View raw message