impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Henry Robinson (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Date Tue, 28 Jun 2016 22:05:08 GMT
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout
......................................................................


Patch Set 15:

(11 comments)

How are you going to test all these failure paths for all the RPCs we have in the system?
How do we convince ourselves that these failures correctly lead to query tear down in all
cases?

http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 97:   void DestroyClient(ClientKey* client_key);
needs a comment


PS15, Line 197: hasRpcError_
C++-style naming


PS15, Line 239: IsTimeoutTException
should now be called IsRecvTimeoutTException()


PS15, Line 240: RPC_TIMEOUT
should be RPC_RECV_TIMEOUT


PS15, Line 246: if (strstr(e.what(),"unknown result") != NULL)
this seems very brittle, and at least should be wrapped in an IsXXXException() kind of method.


PS15, Line 268: release
back pressure is added by a blocking RPC, not released.


PS15, Line 273: timeout
what's the unit?


PS15, Line 272: Status DoRpcTimedWait(const F& f, const Request& request, Response*
response,
              :       const G& wait_func, uint64_t timeout, RuntimeState* state, bool*
can_retry = NULL) 
This seems like it breaks abstractions: not all RPCs happen in the context of a query. I also
don't know what wait_func is for, based on the comments.

I think this might be better written as two methods, which would be called like this:

  Status status = client->DoRpc(...., &can_retry);
  while (status.code() == TErrorCode::RPC_TIMEOUT && can_retry) {
    check_if_cancelled_or_timeout();
    status = client->RetryRpcRecv(recv_func);
  }

I understand the idea behind pushing all this logic into this method, but I think it actually
hides too much from the caller. This allows the caller to take appropriate action on network
timeouts without handing in a runtime state to the client (e.g. the caller may want to do
some more work before blocking again).

The other thing you could do is pass in a callback, so we'd have:

 Status status = client->DoRpc(...., [state]() { return state->is_cancelled(); }, ...);

but my fear is that would get a bit messy.


PS15, Line 282: bool no_timeout = timeout == 0;
this can be simplified by having a deadline variable:

  uint64_t deadline = MonotonicMillis() + timeout;


PS15, Line 309: bool hasRpcError_;
the role of this variable is not clear. Please add a comment.


http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS15, Line 1490: // Try to send the RPC 3 times before failing.
Why try 3 times? Have you seen in your testing where there's failure on the first try, but
success on attempts 2 or 3?


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <jyu@cloudera.com>
Gerrit-Reviewer: Alan Choi <alan@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Juan Yu <jyu@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message