hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Region Server Crashing with : IOE in log roller
Date Wed, 26 Nov 2014 23:57:54 GMT
Region server log snippet was for 09:11:04 while data node log was for
00:02.
Do you observe similar warning around 09:11 in data node log ?

BTW 0.90 release was 3 major releases behind. Please consider upgrading.

Cheers

On Wed, Nov 26, 2014 at 1:43 PM, Adam Wilhelm <awilhelm@mybuys.com> wrote:

> We are running an 80 node cluster:
> Hdfs version: 0.20.2-cdh3u5
> Hbase version: 0.90.6-cdh3u5
>
> The issue we have is that infrequently region servers are crashing. So far
> it has been once a week, not on the same day or time.
>
> The error we are getting in RegionServer logs is:
>
> 2014-11-26 09:11:04,460 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hd073.xxxxxxxx,60020,1407311682582, load=(requests=0,
> regions=227, usedHeap=9293, maxHeap=12250): IOE in log roller
> java.io.IOException: cannot get log writer
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:677)
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:624)
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:560)
>         at
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96)
> Caused by: java.io.IOException: java.io.IOException: Call to
> %NAMENODE%:8020 failed on local exception: java.io.IOException: Connection
> reset by peer
>         at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:674)
>         ... 3 more
> Caused by: java.io.IOException: Call to %NAMENODE%:8020 failed on local
> exception: java.io.IOException: Connection reset by peer
>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1155)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>         at $Proxy7.create(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy7.create(Unknown Source)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3417)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:751)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:200)
>         at
> org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:653)
>         at
> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:444)
>         at sun.reflect.GeneratedMethodAccessor364.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
>         ... 4 more
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcher.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>         at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:376)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:858)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:767)
> 2014-11-26 09:11:04,460 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception: java.io.IOException: Call to %NAMENODE%:8020 failed
> on local exception: java.io.IOException: Connection reset by peer
>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1155)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>         at $Proxy7.addBlock(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy7.addBlock(Unknown Source)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3719)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3586)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2400(DFSClient.java:2792)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)
>
> The servers aren't under any major load but they appear to be having
> issues communicating to the namenode. There are what appear to be
> corresponding errors in the DataNode log. Thos look like:
>
> 2014-11-26 00:02:15,423 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.100.2.76:50010,
> storageID=DS-562360767-10.100.2.76-50010-1358397869707, infoPort=50075,
> ipcPort=50020):Got exception while serving
> blk_-5442848061718769346_625833634 to /10.100.2.76:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.100.2.76:50010
> remote=/10.100.2.76:55462]
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>         at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>         at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175)
>
>
> What I am having trouble proving and then making an educated guess on
> resolving is whether this issue is an actual communication issue with the
> NameNode server due to issues with that server or the issue I have is local
> write issues and timeouts are due to local resource issues on the
> DataNode/RegionServer local server.
>
> We are running RS, DN, and TT on each of the worker server.
>
> Any insight or suggestions would be much appreciated.
>
> Thanks,
>
>
> Adam Wilhelm
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message