accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Lots of "Connection reset by peer"
Date Tue, 30 Aug 2016 22:24:40 GMT
I wonder if this is happening with replication tests because something in
the replication code specifically is failing to close connections and/or
return them to the pool. It would probably be somewhere in the drain()
code, because that's where I saw these tests stuck most often.

On Tue, Aug 30, 2016 at 6:18 PM Marc P. <marc.parisi@gmail.com> wrote:

> Yah I saw this a lot when I wasn't closing thrift connections...but also
> saw it when the client would close prematurely and not return the transport
> to the thrift transport pool .
>
> In one case I hadn't finished with the work in a thread but kept opening
> thrift connections since it would be 'time sliced' for io. In that case I
> opened too many sockets ( fds )...maybe hitting max open files because a
> transport isn't being returned in the middle of a work unit ?
>
> On Tue, Aug 30, 2016, 6:12 PM Christopher <ctubbsii@apache.org> wrote:
>
> > Thrift is not happy on some replication ITs I've run lately. I had one
> test
> > timeout after 40 minutes... and it never finished. The symptom is lots of
> > client side messages about failure to open transport, and the server side
> > messages were (and both were occurring a *lot*, indicating indefinite
> > retries):
> >
> > 2016-08-30 19:48:13,476 [rpc.CustomNonBlockingServer$CustomFrameBuffer]
> > WARN : Got an IOException in internalRead!
> > java.io.IOException: Connection reset by peer
> >         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> >         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
> >         at
> >
> >
> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:142)
> >         at
> >
> >
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:539)
> >         at
> >
> >
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338)
> >         at
> >
> >
> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203)
> >         at
> >
> >
> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:203)
> >         at
> >
> >
> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
> >
> > I saw one comment on a mailing list somewhere that indicated this might
> be
> > caused by a client side handling of a custom Thrift Exception, not
> properly
> > closing the connection. It's possible we're doing something badly before
> we
> > retry. I think more investigation is needed before I file a JIRA (not
> even
> > sure what to file it against, right now... because I'm not sure what
> > component is even at fault).
> >
> > In the meantime, has anybody seen this? Does anybody have any insight
> into
> > this? This is all on a single node, running ITs. There really shouldn't
> be
> > any "network" problems which would cause a TCP reset from external to the
> > test and Accumulo itself, since it's all localhost.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message