hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oussama Jilal <jilal.ouss...@gmail.com>
Subject Re: File Checksum Error
Date Thu, 28 Jul 2016 11:07:39 GMT
In case anyone should face the same issue... I managed to fix this issue by
running:

hadoop fs -get -ignoreCrc
hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
datanode-2,16020,1466263181091.default.1469654596681

then:

hadoop fs -rm
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681

and then:

hadoop fs -put datanode-2,16020,1466263181091.default.1469654596681
/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681

And finally starting HBase again... which basically means getting the file
with the issue from HDFS (ignoring the checksum), and then putting it back!
ouf! (it might be risky... but it worked...)

Best Regards.

On Thu, Jul 28, 2016 at 11:26 AM Oussama Jilal <jilal.oussama@gmail.com>
wrote:

>
> Hello,
>
> This is originally an HBase issue, but please allow me to forward it to
> the hadoop mailing list since the error is in the checksum of a file in
> HDFS...
>
> We are facing a serious issue with our production system, we are on a
> Windows Azure infrastructure and yesterday, unexpectedly, all of our VMs
> restarted... this has caused our HBase cluster (1 Master + 2 Region
> Servers) to crash...
>
> Now whenever I want to start our HBase cluster, it goes down. I checked
> the HBase logs and this is what I found on the master logs:
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-07-28 09:33:10,052 WARN  [main-EventThread]
> coordination.SplitLogManagerCoordination: Error splitting
> /hbase/splitWAL/WALs%2Fdatanode-2%2C16020%2C1466263181091-splitting%2Fdatanode-2%252C16020%252C1466263181
> 091.default.1469654596681
> 2016-07-28 09:33:10,052 WARN  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.SplitLogManager: error while splitting logs in
> [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> installed = 1
> but only 0 done
> 2016-07-28 09:33:10,053 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> executor.EventHandler: Caught throwable while processing event
> M_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> datanode-2,16020,1466263181091, will retry
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error or interrupted while splitting logs
> in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> Task = installed = 1 done = 0 error = 1
>         at
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
>         ... 4 more
> 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.HMaster: Master server abort: loaded coprocessors are: []
> 2016-07-28 09:33:10,055 FATAL [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> datanode-2,16020,1466263181091, will retry
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error or interrupted while splitting logs
> in [hdfs://namenode/hbase/WALs/datanode-2,16020,1466263181091-splitting]
> Task = installed = 1 done = 0 error = 1
>         at
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
>         ... 4 more
> 2016-07-28 09:33:10,055 INFO  [MASTER_SERVER_OPERATIONS-NameNode:16000-3]
> regionserver.HRegionServer: STOPPED: Caught throwable while processing
> event M_SERVER_SHUTDOWN
> 2016-07-28 09:33:10,055 ERROR [MASTER_SERVER_OPERATIONS-NameNode:16000-2]
> executor.EventHandler: Caught throwable while processing event
> M_SERVER_SHUTDOWN
> java.io.IOException: Server is stopped
>         at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:194)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> And this is the one of the Region Servers logs:
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-07-28 09:33:10,033 WARN  [RS_LOG_REPLAY_OPS-datanode-2:16020-0]
> regionserver.SplitLogWorker: log splitting of
> WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
> failed, returning error
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121
> file=/hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
>         at
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:882)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:563)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
>         at java.io.DataInputStream.read(DataInputStream.java:100)
>         at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:298)
>         at
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
>         at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
>         at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
>         at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Now the file "
> */hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681*"
> does exists on the dfs and I have run "*hdfs fsck /*" and it says "*The
> filesystem under path '/' is HEALTHY*". But when I try accessing the file
> bad things happens:
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:30 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:30 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:30 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1
> IOException, will wait for 2671.5631767728437 msec.
> 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:33 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:33 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:33 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2
> IOException, will wait for 8977.03196466135 msec.
> 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> at 5120
> 16/07/28 09:57:42 WARN hdfs.DFSClient: Found Checksum error for
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> at 5120
> 16/07/28 09:57:42 INFO hdfs.DFSClient: Could not obtain
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 from any node:
> java.io.IOException: No live nodes contain block
> BP-304127416-10.0.0.7-1465905487911:blk_1073770935_30121 after checking
> nodes = [DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK],
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]],
> ignoredNodes = null No live nodes contain current block Block locations:
> DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK]
> Dead nodes:  DatanodeInfoWithStorage[10.0.0.9:50010,DS-be379543-51da-481b-ad25-2259d8fbe479,DISK]
> DatanodeInfoWithStorage[10.0.0.8:50010,DS-b5d3244e-3ccb-4c5f-831d-4c300e59ee74,DISK].
> Will get new block locations from namenode and retry...
> 16/07/28 09:57:42 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3
> IOException, will wait for 13504.266287461855 msec.
> cat: Checksum error:
> /hbase/WALs/datanode-2,16020,1466263181091-splitting/datanode-2%2C16020%2C1466263181091.default.1469654596681
> at 5120 exp: 123018583 got: 48747596
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Any help is appreciated in order to recover our HBase cluster...
>

Mime
View raw message