hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Wilhelm <awilh...@mybuys.com>
Subject Region Server Crashing with : IOE in log roller
Date Wed, 26 Nov 2014 21:43:28 GMT
We are running an 80 node cluster:
Hdfs version: 0.20.2-cdh3u5
Hbase version: 0.90.6-cdh3u5

The issue we have is that infrequently region servers are crashing. So far it has been once
a week, not on the same day or time.

The error we are getting in RegionServer logs is:

2014-11-26 09:11:04,460 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server serverName=hd073.xxxxxxxx,60020,1407311682582, load=(requests=0, regions=227,
usedHeap=9293, maxHeap=12250): IOE in log roller
java.io.IOException: cannot get log writer
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:677)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:624)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:560)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96)
Caused by: java.io.IOException: java.io.IOException: Call to %NAMENODE%:8020 failed on local
exception: java.io.IOException: Connection reset by peer
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:674)
        ... 3 more
Caused by: java.io.IOException: Call to %NAMENODE%:8020 failed on local exception: java.io.IOException:
Connection reset by peer
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
        at org.apache.hadoop.ipc.Client.call(Client.java:1155)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy7.create(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy7.create(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3417)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:751)
        at org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:200)
        at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:653)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:444)
        at sun.reflect.GeneratedMethodAccessor364.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
        ... 4 more
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
        at sun.nio.ch.IOUtil.read(IOUtil.java:175)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
        at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:376)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:858)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:767)
2014-11-26 09:11:04,460 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException:
Call to %NAMENODE%:8020 failed on local exception: java.io.IOException: Connection reset by
peer
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
        at org.apache.hadoop.ipc.Client.call(Client.java:1155)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy7.addBlock(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy7.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3719)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3586)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2400(DFSClient.java:2792)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)

The servers aren't under any major load but they appear to be having issues communicating
to the namenode. There are what appear to be corresponding errors in the DataNode log. Thos
look like:

2014-11-26 00:02:15,423 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.100.2.76:50010,
storageID=DS-562360767-10.100.2.76-50010-1358397869707, infoPort=50075, ipcPort=50020):Got
exception while serving blk_-5442848061718769346_625833634 to /10.100.2.76:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready
for write. ch : java.nio.channels.SocketChannel[connected local=/10.100.2.76:50010 remote=/10.100.2.76:55462]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175)


What I am having trouble proving and then making an educated guess on resolving is whether
this issue is an actual communication issue with the NameNode server due to issues with that
server or the issue I have is local write issues and timeouts are due to local resource issues
on the DataNode/RegionServer local server.

We are running RS, DN, and TT on each of the worker server.

Any insight or suggestions would be much appreciated.

Thanks,


Adam Wilhelm


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message