Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 20 Jan 2017 04:33:27 +0000 (UTC)
From: "Xuefu Zhang (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13036412.1484880595000.59643.1484886807252@Atlassian.JIRA>
In-Reply-To: <JIRA.13036412.1484880595000@Atlassian.JIRA>
References: <JIRA.13036412.1484880595000@Atlassian.JIRA> <JIRA.13036412.1484880595438@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient()
 erroneously uses server/client handshake timeout for connection timeout
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 20 Jan 2017 04:33:34 -0000


    [ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15831178#comment-15831178 ] 

Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 4:33 AM:
-------------------------------------------------------------

Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using {{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout using  {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.

On the other hand, 
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores {{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is TimeoutException("Timed out waiting for client connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with {{getConnectTimeoutMs()}}.

To me, it seems there is mismatch between 3 and 4. In 3, the timeout message implies "connection timeout", while the value is what is supposed to be that for saslHandler handshake. This is why I think 3 should use {{getConnectTimeoutMs()}} instead.

Could you take another look?

I actually ran into issues with this. Our cluster is constantly busy, and it takes minutes for the Hive to get a YARN container to launch the remote driver. In that case, the query fails with a failure of creating a spark session. For such a scenario, I supposed we should increase *client.connect.timeout*. However, that's not effective. On the other hand, if I increase *server.connect.timeout*, Hive waits longer  for the driver to come up, which is good. However, doing that has a bad consequence that Hive will wait as long to declare a failure if for any reason the remote driver becomes dead.

With the patch in place, the problem is solved in both cases. I only need to increase *client.connect.timeout* and keep *server.connect.timeout* unchanged.


was (Author: xuefuz):
Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using {{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout using  {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.

On the other hand, 
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores {{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is TimeoutException("Timed out waiting for client connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with {{getConnectTimeoutMs()}}.

To me, it seems there is mismatch between 3 and 4. In 3, the timeout message implies "connection timeout", while the value is what is supposed to be that for saslHandler handshake. This is why I think 3 should use {{getConnectTimeoutMs()}} instead.

Could you take another look?

I actually ran into issues with this. Our cluster is constantly busy, and it takes minutes for the Hive's spark session to get a container to launch the remote driver. In that case, the query fails with a failure of creating a spark session. For such a scenario, I supposed we should increase *client.connect.timeout*. However, that's not effective. On the other hand, if I increase *server.connect.timeout*, Hive waits longer  for the driver to come up, which is good. However, doing that has a bad consequence that Hive will wait as long to declare a failure if for any reason the remote driver becomes dead.

With the patch in place, the problem is solved in both cases. I only need to increase *client.connect.timeout* and keep *server.connect.timeout* unchanged.

> RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-15671
>                 URL: https://issues.apache.org/jira/browse/HIVE-15671
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15671.patch
>
>
> {code}
>   /**
>    * Tells the RPC server to expect a connection from a new client.
>    * ...
>    */
>   public Future<Rpc> registerClient(final String clientId, String secret,
>       RpcDispatcher serverDispatcher) {
>     return registerClient(clientId, secret, serverDispatcher, config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for *hive.spark.client.server.connect.timeout*, which is meant for timeout for handshake between Hive client and remote Spark driver. Instead, the timeout should be *hive.spark.client.connect.timeout*, which is for timeout for remote Spark driver in connecting back to Hive client.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)