accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Lots of "Connection reset by peer"
Date Tue, 30 Aug 2016 22:12:27 GMT
Thrift is not happy on some replication ITs I've run lately. I had one test
timeout after 40 minutes... and it never finished. The symptom is lots of
client side messages about failure to open transport, and the server side
messages were (and both were occurring a *lot*, indicating indefinite
retries):

2016-08-30 19:48:13,476 [rpc.CustomNonBlockingServer$CustomFrameBuffer]
WARN : Got an IOException in internalRead!
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
        at
org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:142)
        at
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:539)
        at
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338)
        at
org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203)
        at
org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:203)
        at
org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)

I saw one comment on a mailing list somewhere that indicated this might be
caused by a client side handling of a custom Thrift Exception, not properly
closing the connection. It's possible we're doing something badly before we
retry. I think more investigation is needed before I file a JIRA (not even
sure what to file it against, right now... because I'm not sure what
component is even at fault).

In the meantime, has anybody seen this? Does anybody have any insight into
this? This is all on a single node, running ITs. There really shouldn't be
any "network" problems which would cause a TCP reset from external to the
test and Accumulo itself, since it's all localhost.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message