impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Juan Yu (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Date Mon, 13 Jun 2016 16:06:03 GMT
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout

Patch Set 4:

File be/src/runtime/

Line 125: void ClientCacheHelper::ReleaseClient(ClientKey* client_key) {
> I think it would be better to reset the timeouts to the default values when
Good idea.
I am still running tests to see if we do need different timeout for different RPC calls.
File be/src/runtime/client-cache.h:

PS4, Line 101: SetRpcTimeout
> Would this be clearer as SetClientRpcTimeout() here and below? On first loo

PS4, Line 239: SetRpcTimeout(send_timeout, recv_timeout);
> I'm a little worried about calling this every time we call a DoRpc() since 
You're right, set the timeout per client makes sense.
File be/src/runtime/

PS4, Line 556: is_cancelled_
> Can this also be IsCompleted()? Is it a possibility that there is a sequenc
StopReportThread is called if fragment is completed, or cancel happened during GetNext().

If an internal error happened at other time, it usually trigger internal cancelling, Cancel()
will be called, 
But I agree, IsCompleted() seems a safer check here.

Line 559:     UpdateStatus(Status(TErrorCode::CANCELLED, "Fragment already cancelled."));
> Should this be an error? Or is it better to just return since the user will
I don't think this should be treat as error. the only issue is the report thread keep running
and send status to coordinator which is confusing. 
I'll just call StopReportThread().
File be/src/service/

PS4, Line 128: cause problem
> If we know exactly what sort of problems it could cause, it would be worth 
UpdateFragmentExecStatus() will decrease num_remaining_fragment_instances_ counter, call it
multiple times for a completed fragment will cause DCHECK.
I'll add those to comments.
File tests/custom_cluster/

PS4, Line 39: assert "RPC" in str(ex)
> 'ex' should contain "RPC timed out" right? This would pass for "RPC Error" 
We used to capture only one specific timeout case, and try reopen client for all other errors.
I am not sure that's correct way so I modified the exception handling. But now I see more
RPC error case than before. Maybe I should roll back the exception handling change.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <>
Gerrit-Reviewer: Alan Choi <>
Gerrit-Reviewer: Henry Robinson <>
Gerrit-Reviewer: Juan Yu <>
Gerrit-Reviewer: Sailesh Mukil <>
Gerrit-HasComments: Yes

View raw message