hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Naudus" <pnau...@dataraker.com>
Subject Single disk failure on single node causes 1 data node, 3 region servers to go down
Date Thu, 21 Jun 2012 15:57:55 GMT
Hello All,

As this problem has both a Hadoop and HBase component, rather than posting  
the same message to both groups, I'm posting the datanode portion of this  
problem under the title of "Single disk failure (with HDFS-457 applied)  
causes data node to die" to the Hadoop users group.

In our production cluster we have 9 servers: each with 4 disks. Each  
server runs both a data node and region server.

This morning one of our servers lost one of its disks. This caused a  
cascading effect which resulted in the data node and 3 of 9 region servers  
go down. The system was in the middle of compaction and everything was  
running normally.

All snippets are from the region server log on xxx.xxx.xxx.123

<snip>
12/06/21 10:26:54 INFO regionserver.HRegion: completed compaction on  
region  
fpl,P.2563737_TS3600_D.1328072400,1330649094970.68469595e2cca48506cd6756c7a9e6c3.  
after 2mins, 42sec
12/06/21 10:26:54 INFO regionserver.HRegion: Starting compaction on region  
fpl,P.69362_TS3600_D.1298955600,1322201140663.fb0ebebfe2257b13d596d74d608a822a.
12/06/21 10:26:54 INFO regionserver.Store: Started compaction of 2 file(s)  
in cf=D into  
hdfs://xxxx:8020/hbase/xxxx/fb0ebebfe2257b13d596d74d608a822a/.tmp,  
seqid=83988687, totalSize=2.0g
12/06/21 10:29:46 INFO regionserver.Store: Completed major compaction of 2  
file(s), new  
file=hdfs://xxxx:8020/hbase/xxxx/fb0ebebfe2257b13d596d74d608a822a/D/1647142169914040514, 

size=2.0g; total size for store is 2.0g
</snip>

Then, the disk failed (on xxx.xxx.xxx.129) causing the data node on that  
server to die:

<snip>
12/06/21 12:35:23 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor  
exception for block blk_-8250761279849076686_833850java.io.IOException:  
Bad response 1 for block blk_-8250761279849076686_833850 from datanode  
xxx.xxx.xxx.129:50010
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2664)
...
12/06/21 12:35:23 INFO hdfs.DFSClient: Error Recovery for block  
blk_-8250761279849076686_833850 waiting for responder to exit.
12/06/21 12:35:23 WARN wal.HLog: IPC Server handler 1 on 60020 took 51641  
ms appending an edit to hlog; editcount=1367, len~=34.2k
12/06/21 12:35:23 WARN hdfs.DFSClient: Error Recovery for block  
blk_-8250761279849076686_833850 bad datanode[1] xxx.xxx.xxx.129:50010
</snip>

Other then xxx.xxx.xxx.129, this data block was also stored on  
xxx.xxx.xxx.127 and xxx.xxx.xxx.123 (the data node local to this region  
server). However, the region server was unable to read from these data  
nodes either. The strange thing is that although *.129 went down, the  
other data nodes remained up.

<snip>
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
blk_-8250761279849076686_833850 failed because recovery from primary  
datanode xxx.xxx.xxx.123:50010 failed 2 times. Pipeline was  
xxx.xxx.xxx.123:50010, xxx.xxx.xxx.129:50010, xxx.xxx.xxx.xxx:50010. Will  
retry...
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
blk_-8250761279849076686_833850 bad datanode[1] xxx.xxx.xxx.129:50010
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
blk_-8250761279849076686_833850 in pipeline xxx.xxx.xxx.123:50010,  
xxx.xxx.xxx.129:50010, xxx.xxx.xxx.127:50010: bad datanode  
xxx.xxx.xxx.129:50010
12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server handler 16 on 60020  
caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at  
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:36:23 WARN hdfs.DFSClient: Failed recovery attempt #1 from  
primary datanode xxxx.xxx.xxx.123:50010
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block  
blk_-8250761279849076686_833850 is already being recovered, ignoring this  
request to recover it.
at  
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1765)
at  
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1950)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at  
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at  
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at  
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy10.recoverBlock(Unknown Source)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2793)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
</snip>

this is repeated until

<snip>
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
blk_-8250761279849076686_833850 failed because recovery from primary  
datanode xxx.xxx.xxx.123:50010 failed 6 times. Pipeline was  
xxx.xxx.xxx.123:50010, xxx.xxx.xxx.129:50010, 10.128.204.127:50010.  
Marking primary datanode as bad.

12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server Responder, call  
multi(org.apache.hadoop.hbase.client.MultiAction@1492dba1) from  
xxx.xxx.xxx.127:51903: output error

12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server handler 23 on 60020  
caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at  
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:37:23 WARN hdfs.DFSClient: Failed recovery attempt #0 from  
primary datanode xxx.xxx.xxx.127:50010
java.net.SocketTimeoutException: Call to /xxx.xxx.xxx.127:50020 failed on  
socket timeout exception: java.net.SocketTimeoutException: 60000 millis  
timeout while waiting for channel to be ready for read. ch :  
java.nio.channels.SocketChannel[connected local=/xxx.xxx.xxx.123:54996  
remote=/xxx.xxx.xxx.127:50020]
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1138)
at org.apache.hadoop.ipc.Client.call(Client.java:1110)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy10.recoverBlock(Unknown Source)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2793)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while  
waiting for channel to be ready for read. ch :  
java.nio.channels.SocketChannel[connected local=/xxx.xxx.xxx.123:54996  
remote=/xxx.xxx.xxx.127:50020]
at  
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at  
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:375)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)
</snip>

which was repeated until xxx.xxx.xxx.123, xxx.xxx.xxx.127, and  
xxx.xxx.xxx.129 were all marked as bad

After this, I saw problems (with what I'm assuming is) with hbase trying  
to talk to other region servers

<snip>
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call  
multi(org.apache.hadoop.hbase.client.MultiAction@54d0259b) from  
xxx.xxx.xxx.127:54700: output error
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 8 on 60020  
caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at  
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call  
multi(org.apache.hadoop.hbase.client.MultiAction@2d550e9d) from  
xxx.xxx.xxx.125:36524: output error
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 4 on 60020  
caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1389)
at  
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call  
multi(org.apache.hadoop.hbase.client.MultiAction@36c3ec1f) from  
xxx.xxx.xxx.128:45630: output error
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 17 on 60020  
caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1389)
at  
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at  
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
</snip>

The region servers on xxx.xxx.xxx.123 (this region server),  
xxx.xxx.xxx.128 and xxx.xxx.xxx.129 all went down. However, it is strange  
that there were I/O errors talking to xxx.xxx.xxx.125 and xxx.xxx.xxx.127  
as well (which didn't go down).

Eventually the region server dies:

<snip>
java.io.InterruptedIOException: Aborting compaction of store D in region  
fpl,P.1596002_TS3600_D.1304226000,1334196560513.4106274c5a8852493fc20d2e50a7e428.  
because user requested stop.
at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:961)
at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:743)
at  
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:808)
at  
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:748)
at  
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)

12/06/21 12:39:24 ERROR hdfs.DFSClient: Exception closing file  
/hbase/.logs/xxxxxxpnb003.dataraker.net,60020,1338412070372/xxxxxxpnb003.dataraker.net%3A60020.1340272250127
 
: java.io.IOException: Error Recovery for block  
blk_-8250761279849076686_833850 failed because recovery from primary  
datanode 10.128.204.129:50010 failed 6 times. Pipeline was  
xxx.xxx.xxx.xxx:50010. Aborting...
java.io.IOException: Error Recovery for block  
blk_-8250761279849076686_833850 failed because recovery from primary  
datanode xxx.xxx.xxx.xxx:50010 failed 6 times. Pipeline was  
xxx.xxx.xxx.129:50010. Aborting...
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2833)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
at  
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
12/06/21 12:39:24 INFO regionserver.ShutdownHook: Shutdown hook finished.
</snip>

The other two nodes that died also have similar logs.

I've uploaded the log to:  
http://www.linuxlefty.com/hbase-hbase-regionserver-xxxxxpnb003.dataraker.net.out

-- 

Sincerely,

     ~PeterP	Peter Naudus
DataRaker
Email: pnaudus@dataraker.com

The contents of this email and any attachments are confidential. They are  
intended for the named recipient(s) only.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message