hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
Date Fri, 20 Jan 2017 04:33:27 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15831178#comment-15831178
] 

Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 4:33 AM:
-------------------------------------------------------------

Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using {{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout using  {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.

On the other hand, 
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores {{getServerConnectTimeoutMs()}}.
And, the timeout happens, the exception is TimeoutException("Timed out waiting for client
connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with {{getConnectTimeoutMs()}}.

To me, it seems there is mismatch between 3 and 4. In 3, the timeout message implies "connection
timeout", while the value is what is supposed to be that for saslHandler handshake. This is
why I think 3 should use {{getConnectTimeoutMs()}} instead.

Could you take another look?

I actually ran into issues with this. Our cluster is constantly busy, and it takes minutes
for the Hive to get a YARN container to launch the remote driver. In that case, the query
fails with a failure of creating a spark session. For such a scenario, I supposed we should
increase *client.connect.timeout*. However, that's not effective. On the other hand, if I
increase *server.connect.timeout*, Hive waits longer  for the driver to come up, which is
good. However, doing that has a bad consequence that Hive will wait as long to declare a failure
if for any reason the remote driver becomes dead.

With the patch in place, the problem is solved in both cases. I only need to increase *client.connect.timeout*
and keep *server.connect.timeout* unchanged.


was (Author: xuefuz):
Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using {{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout using  {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.

On the other hand, 
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores {{getServerConnectTimeoutMs()}}.
And, the timeout happens, the exception is TimeoutException("Timed out waiting for client
connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with {{getConnectTimeoutMs()}}.

To me, it seems there is mismatch between 3 and 4. In 3, the timeout message implies "connection
timeout", while the value is what is supposed to be that for saslHandler handshake. This is
why I think 3 should use {{getConnectTimeoutMs()}} instead.

Could you take another look?

I actually ran into issues with this. Our cluster is constantly busy, and it takes minutes
for the Hive's spark session to get a container to launch the remote driver. In that case,
the query fails with a failure of creating a spark session. For such a scenario, I supposed
we should increase *client.connect.timeout*. However, that's not effective. On the other hand,
if I increase *server.connect.timeout*, Hive waits longer  for the driver to come up, which
is good. However, doing that has a bad consequence that Hive will wait as long to declare
a failure if for any reason the remote driver becomes dead.

With the patch in place, the problem is solved in both cases. I only need to increase *client.connect.timeout*
and keep *server.connect.timeout* unchanged.

> RPCServer.registerClient() erroneously uses server/client handshake timeout for connection
timeout
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-15671
>                 URL: https://issues.apache.org/jira/browse/HIVE-15671
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15671.patch
>
>
> {code}
>   /**
>    * Tells the RPC server to expect a connection from a new client.
>    * ...
>    */
>   public Future<Rpc> registerClient(final String clientId, String secret,
>       RpcDispatcher serverDispatcher) {
>     return registerClient(clientId, secret, serverDispatcher, config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for *hive.spark.client.server.connect.timeout*,
which is meant for timeout for handshake between Hive client and remote Spark driver. Instead,
the timeout should be *hive.spark.client.connect.timeout*, which is for timeout for remote
Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message