thrift-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuxuan Wang (Jira)" <>
Subject [jira] [Created] (THRIFT-5233) I/O timeout handling in go library
Date Fri, 12 Jun 2020 21:50:00 GMT
Yuxuan Wang created THRIFT-5233:

             Summary: I/O timeout handling in go library
                 Key: THRIFT-5233
             Project: Thrift
          Issue Type: Improvement
          Components: Go - Compiler, Go - Library
    Affects Versions: 0.13.0
            Reporter: Yuxuan Wang

While debugging the bug in the first implementation of THRIFT-5214, I started to look into
the rabbit hole of I/O timeouts in the go library (mainly the socket timeout on TSocket),
and start to realize that the current way we handle it is not great.

>From client's perspective, how a request goes is:

client prepare the request TStruct -> send the request over the wire -> start reading
the response over the wire -> handle the response or I/O error

The problem here, is that server will only send the first byte of response out after it finished
processing the request. So if the client incorrectly configured a socket timeout on the TSocket
used by this request to a value that's too low, the reading will just report i/o timeout when
the deadline reaches, and the client will just fail the whole request.

This will cause problems:

# Client can't set socket timeout to something too low. In fact, if they use a client pool
(so for most requests there's no overhead on connecting), they need the socket timeout to
be at least the latency SLA of the server they talk to, otherwise there's a serious risk that
the client will fail a lot of requests prematurely.
# On the other hand, setting the socket timeout to something too high is also a bad practice.
In case that server is overloaded and cannot handle requests in a timely fashion, client should
have a way to fail faster, instead of waiting for a very long timeout to expire.

If the client set a deadline on the context object with the call, their expectation usually
would be that the request will fail after the deadline passed (a small overhead is usually
acceptable). But the socket timeout is usually some fixed value configured to the program,
while the actual deadline on the context is more variable (e.g. this could be a server talking
to upstream server, it has a fixed totally deadline for the whole endpoint, but the steps
before that might take variable time so the deadline left here can vary a lot).

This leads to the point that we would need a way to keep socket timeout and the deadline set
in the context object in-sync.

There are a few different options. The first obvious one, is to pass the context object all
the way to TSocket.Read, so that function can extract the deadline out of the context object
(if any), and set that as the socket timeout instead of the pre-configured, fixed one.

But that is also easy to be ruled out, because that would change the function signature of
TSocket.Read, making TSocket no longer implement io.Reader interface.

The next option, I think, is to handle it on TProtocol level. The idea is that we pass the
context object all the way into TProtocol.ReadMessageBegin (and maybe other read functions
of TProcotol?). This way, it can handle the read errors, check if it's a timeout error, and
if it is and the context deadline haven't passed, it can just keep retrying again. This way,
we can set the fixed socket timeout to a low number when we know we will always have a real
deadline in the context attached, and just let TProtocol implementation to keep retrying instead.
If the user don't attach a deadline to the context object, then they still need to set a large
number on the socket timeout.

A slightly different option, is to add SetReadTimeout function to TTransport interface. So
all the TTransport implementations wrapping TSocket should just pass that along. If the terminal
one is actually not a TSocket (for example, it's TMemoryBuffer), then this function just does
nothing. This way, the TProtocol.Read* functions can just extract deadline out of the context
object, and call their TTransport.SetReadTimeout with that deadline.

Either way, this is going to be a breaking change that we need to add context object to TProtocol.Read*
function signatures. As a result, I believe some compiler change might also be required to
pass it all the way to TProtocol calls. But even if compiler changes are needed, that part
should be minimal (just make sure we are passing context object correctly), and most of the
changes would be on TProtocol implementations.

This message was sent by Atlassian Jira

View raw message