hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LeBlanc, Jacob" <jacob.lebl...@hpe.com>
Subject responseTooSlow warnings and hdfs ConnectTimeoutExceptions
Date Mon, 15 Aug 2016 15:05:25 GMT
Hi all,

We have a production cluster where we are seeing periodic client RPC timeouts as well as responseTooSlow
warnings, scanner lease expirations, and hdfs read timeouts on some regions servers. We've
been trying to diagnose this for a couple weeks, but still no luck on finding a root cause.
Anybody with similar past experiences or thoughts on other things we can look for would be
very much appreciated.

When the issues first showed up, it seemed to be isolated to a single region server so we
suspected hardware issues. However, after dropping it out of the cluster, two other servers
started showing similar problems. We've run hbck and hdfs fsck and they come up clean.

After suspecting the culprit might be long GC pauses in the region server, we enabling GC
logging but that didn't show anything too crazy (occasional promotion failures causing a 4-5
second pause, but even those don't seem to line up with errors and warnings in the region
server logs).

Could this simply be a matter of too much load causing I/O to block for long periods of time?
We've been trying to correlate the problems in the region server logs with anything in our
environment that might cause huge spikes on read or write load but so far no smoking gun.
We've also tried playing with the OS's disk write buffer settings (like vm.dirty_background_ratio
and vm.dirty_ratio) but no luck. Our cluster is certainly under moderate read and write loads,
but nothing that I would have thought would cause problems like the 60 second HDFS read timeouts.
Here is one example of those timeouts from the log:

2016-08-05 20:53:19,194 WARN  [regionserver60020.replicationSource,a2] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel
to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/]
                at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
                at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
                at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
                at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
                at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
                at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:618)
                at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
                at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
                at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:697)
                at java.io.DataInputStream.readInt(Unknown Source)
                at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.setTrailerIfPresent(ProtobufLogReader.java:186)
                at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initInternal(ProtobufLogReader.java:155)
                at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initReader(ProtobufLogReader.java:106)
                at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:69)
                at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:126)
                at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
                at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
                at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
                at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:503)
                at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:309)

Those timeouts seem to occur virtually anywhere, not just while replicating the WAL to our
other cluster. And they aren't limited to a single region or even a single table.

So any thoughts on where we could look next? Anybody seen this before and attributed it to
anything other than spiky loads? Any good way to identify abnormal load spikes?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message