hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Naudus" <pnau...@dataraker.com>
Subject Re: Is it normal that a dead data node causes 3 region servers to go down?
Date Mon, 25 Jun 2012 12:52:30 GMT
Hello All,

With the help of the Hadoop community we've identified why the data node  
died. However, I'm still trying to resolve my HBase issues.

Is it normal that when a data node goes down it brings down 3 region  
servers with it? I was under the impression that the HBase region servers  
had some kind of failover mechanism that would prevent this. Since there  
are multiple copies of the data stored, it doesn't make sense that the  
inaccessibility of one copy causes all other copies to also become  
inaccessible.

~ Peter

On Thu, 21 Jun 2012 11:57:55 -0400, Peter Naudus <pnaudus@dataraker.com>  
wrote:

> Hello All,
>
> As this problem has both a Hadoop and HBase component, rather than  
> posting the same message to both groups, I'm posting >the datanode  
> portion of this problem under the title of "Single disk failure (with  
> HDFS-457 applied) causes data node to >die" to the Hadoop users group.
>
> In our production cluster we have 9 servers: each with 4 disks. Each  
> server runs both a data node and region server.
>
> This morning one of our servers lost one of its disks. This caused a  
> cascading effect which resulted in the data node >and 3 of 9 region  
> servers go down. The system was in the middle of compaction and  
> everything was running normally.
>
> All snippets are from the region server log on xxx.xxx.xxx.123
>
> <snip>
> 12/06/21 10:26:54 INFO regionserver.HRegion: completed compaction on  
> region  
> >fpl,P.2563737_TS3600_D.1328072400,1330649094970.68469595e2cca48506cd6756c7a9e6c3.
 
> after 2mins, 42sec
> 12/06/21 10:26:54 INFO regionserver.HRegion: Starting compaction on  
> region  
> >fpl,P.69362_TS3600_D.1298955600,1322201140663.fb0ebebfe2257b13d596d74d608a822a.
> 12/06/21 10:26:54 INFO regionserver.Store: Started compaction of 2  
> file(s) in cf=D into  
> hdfs://xxxx:8020/hbase/xxxx/>fb0ebebfe2257b13d596d74d608a822a/.tmp,  
> seqid=83988687, totalSize=2.0g
> 12/06/21 10:29:46 INFO regionserver.Store: Completed major compaction of  
> 2 file(s), new  
> file=hdfs://xxxx:8020/hbase/>xxxx/fb0ebebfe2257b13d596d74d608a822a/D/1647142169914040514,
 
> size=2.0g; total size for store is 2.0g
> </snip>
>
> Then, the disk failed (on xxx.xxx.xxx.129) causing the data node on that  
> server to die:
>
> <snip>
> 12/06/21 12:35:23 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor  
> exception for block >blk_-8250761279849076686_833850java.io.IOException:  
> Bad response 1 for block blk_-8250761279849076686_833850 from >datanode  
> xxx.xxx.xxx.129:50010
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2664)
> ...
> 12/06/21 12:35:23 INFO hdfs.DFSClient: Error Recovery for block  
> blk_-8250761279849076686_833850 waiting for responder to >exit.
> 12/06/21 12:35:23 WARN wal.HLog: IPC Server handler 1 on 60020 took  
> 51641 ms appending an edit to hlog; editcount=1367, >len~=34.2k
> 12/06/21 12:35:23 WARN hdfs.DFSClient: Error Recovery for block  
> blk_-8250761279849076686_833850 bad datanode[1] >xxx.xxx.xxx.129:50010
> </snip>
>
> Other then xxx.xxx.xxx.129, this data block was also stored on  
> xxx.xxx.xxx.127 and xxx.xxx.xxx.123 (the data node local >to this region  
> server). However, the region server was unable to read from these data  
> nodes either. The strange thing is >that although *.129 went down, the  
> other data nodes remained up.
>
> <snip>
> 12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
> blk_-8250761279849076686_833850 failed because recovery >from primary  
> datanode xxx.xxx.xxx.123:50010 failed 2 times. Pipeline was  
> xxx.xxx.xxx.123:50010, xxx.xxx.xxx.129:50010, >xxx.xxx.xxx.xxx:50010.  
> Will retry...
> 12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
> blk_-8250761279849076686_833850 bad datanode[1] >xxx.xxx.xxx.129:50010
> 12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
> blk_-8250761279849076686_833850 in pipeline >xxx.xxx.xxx.123:50010,  
> xxx.xxx.xxx.129:50010, xxx.xxx.xxx.127:50010: bad datanode  
> xxx.xxx.xxx.129:50010
> 12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server handler 16 on 60020  
> caught: java.nio.channels.ClosedChannelException
> at  
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
>
> 12/06/21 12:36:23 WARN hdfs.DFSClient: Failed recovery attempt #1 from  
> primary datanode xxxx.xxx.xxx.123:50010
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block  
> blk_-8250761279849076686_833850 is already being >recovered, ignoring  
> this request to recover it.
> at  
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1765)
> at  
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1950)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at  
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at  
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1107)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
> at $Proxy10.recoverBlock(Unknown Source)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2793)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
> </snip>
>
> this is repeated until
>
> <snip>
> 12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block  
> blk_-8250761279849076686_833850 failed because recovery >from primary  
> datanode xxx.xxx.xxx.123:50010 failed 6 times. Pipeline was  
> xxx.xxx.xxx.123:50010, xxx.xxx.xxx.129:50010, >10.128.204.127:50010.  
> Marking primary datanode as bad.
>
> 12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server Responder, call  
> multi>(org.apache.hadoop.hbase.client.MultiAction@1492dba1) from  
> xxx.xxx.xxx.127:51903: output error
>
> 12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server handler 23 on 60020  
> caught: java.nio.channels.ClosedChannelException
> at  
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
>
> 12/06/21 12:37:23 WARN hdfs.DFSClient: Failed recovery attempt #0 from  
> primary datanode xxx.xxx.xxx.127:50010
> java.net.SocketTimeoutException: Call to /xxx.xxx.xxx.127:50020 failed  
> on socket timeout exception: >java.net.SocketTimeoutException: 60000  
> millis timeout while waiting for channel to be ready for read. ch :  
> >java.nio.channels.SocketChannel[connected local=/xxx.xxx.xxx.123:54996  
> remote=/xxx.xxx.xxx.127:50020]
> at org.apache.hadoop.ipc.Client.wrapException(Client.java:1138)
> at org.apache.hadoop.ipc.Client.call(Client.java:1110)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
> at $Proxy10.recoverBlock(Unknown Source)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2793)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while  
> waiting for channel to be ready for read. ch :  
> >java.nio.channels.SocketChannel[connected local=/xxx.xxx.xxx.123:54996  
> remote=/xxx.xxx.xxx.127:50020]
> at  
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at  
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at  
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at  
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:375)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> at java.io.DataInputStream.readInt(DataInputStream.java:370)
> at  
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)
> </snip>
>
> which was repeated until xxx.xxx.xxx.123, xxx.xxx.xxx.127, and  
> xxx.xxx.xxx.129 were all marked as bad
>
> After this, I saw problems (with what I'm assuming is) with hbase trying  
> to talk to other region servers
>
> <snip>
> 12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call  
> multi>(org.apache.hadoop.hbase.client.MultiAction@54d0259b) from  
> xxx.xxx.xxx.127:54700: output error
> 12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 8 on 60020  
> caught: java.nio.channels.ClosedChannelException
> at  
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
>
> 12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call  
> multi>(org.apache.hadoop.hbase.client.MultiAction@2d550e9d) from  
> xxx.xxx.xxx.125:36524: output error
> 12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 4 on 60020  
> caught: java.nio.channels.ClosedChannelException
> at  
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1389)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
>
> 12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call  
> multi>(org.apache.hadoop.hbase.client.MultiAction@36c3ec1f) from  
> xxx.xxx.xxx.128:45630: output error
> 12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 17 on 60020  
> caught: java.nio.channels.ClosedChannelException
> at  
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1389)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
> at  
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
> </snip>
>
> The region servers on xxx.xxx.xxx.123 (this region server),  
> xxx.xxx.xxx.128 and xxx.xxx.xxx.129 all went down. However, >it is  
> strange that there were I/O errors talking to xxx.xxx.xxx.125 and  
> xxx.xxx.xxx.127 as well (which didn't go down).
>
> Eventually the region server dies:
>
> <snip>
> java.io.InterruptedIOException: Aborting compaction of store D in region  
> >fpl,P.1596002_TS3600_D.1304226000,1334196560513.4106274c5a8852493fc20d2e50a7e428.
 
> because user requested stop.
> at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:961)
> at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:743)
> at  
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:808)
> at  
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:748)
> at  
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>
> 12/06/21 12:39:24 ERROR hdfs.DFSClient: Exception closing file  
> /hbase/.logs/>xxxxxxpnb003.dataraker.net,60020,1338412070372/xxxxxxpnb003.dataraker.net%3A60020.1340272250127
 
> : java.io.IOException: >Error Recovery for block  
> blk_-8250761279849076686_833850 failed because recovery from primary  
> datanode >10.128.204.129:50010 failed 6 times. Pipeline was  
> xxx.xxx.xxx.xxx:50010. Aborting...
> java.io.IOException: Error Recovery for block  
> blk_-8250761279849076686_833850 failed because recovery from primary  
> >datanode xxx.xxx.xxx.xxx:50010 failed 6 times. Pipeline was  
> xxx.xxx.xxx.129:50010. Aborting...
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2833)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
> at  
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
> 12/06/21 12:39:24 INFO regionserver.ShutdownHook: Shutdown hook finished.
> </snip>
>
> The other two nodes that died also have similar logs.
>
> I've uploaded the log to:  
> http://www.linuxlefty.com/hbase-hbase-regionserver-xxxxxpnb003.dataraker.net.out
>
> --
>> Sincerely,
>
>    ~Peter>P	>Peter Naudus
> DataRaker
> Email: pnaudus@dataraker.com

> The contents of this email and any attachments are confidential. They  
> are intended for the named recipient(s) only.
>



-- 

Sincerely,

     ~PeterP	Peter Naudus
DataRaker
Cell: 917.689.8451
Work: 703.639.4010
Email: pnaudus@dataraker.com

The contents of this email and any attachments are confidential. They are  
intended for the named recipient(s) only.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message