hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
Date Fri, 20 Jan 2017 23:13:26 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832582#comment-15832582
] 

Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 11:12 PM:
--------------------------------------------------------------

Patch #1 followed what [~vanzin] suggested. With it, I observed the following behavior:

1. Increasing *server.connect.timeout* will make hive wait longer for the driver to connect
back, which solves the busy cluster problem.
2. Killing driver while the job is running immediately fails the query on Hive side with the
following error:
{code}
2017-01-20 22:01:08,235	Stage-2_0: 7(+3)/685	Stage-3_0: 0/1	
2017-01-20 22:01:09,237	Stage-2_0: 16(+6)/685	Stage-3_0: 0/1	
Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC channel is closed.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}

This meets my expectation.

However, I didn't test the case of driver death before connecting back to Hive. (It's also
hard to construct such a test case.) In that case, I assume that Hive will wait for *server.connect.timeout*
before declaring a failure. I guess there isn't much we can do for this case. I don't think
the change here has any implication on this.


was (Author: xuefuz):
Patch #1 followed what [~vanzin] suggested. With it, I observed the following behavior:

1. Increasing *server.connect.timeout* will make hive wait longer for the driver to connect
back, which solves the busy cluster problem.
2. Killing driver while the job is running immediately fails the query on Hive side with the
following error:
{code}
2017-01-20 22:01:08,235	Stage-2_0: 7(+3)/685	Stage-3_0: 0/1	
2017-01-20 22:01:09,237	Stage-2_0: 16(+6)/685	Stage-3_0: 0/1	
Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC channel is closed.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}

This meets my expectation.

However, I didn't test the case of driver death before connecting back to Hive. (It's also
hard to construct such a test case.) In that case, I assume that Hive will wait for *server.connect.timeout*
before declares a failure. I guess there isn't much we can do for this case. I don't think
the change here has any implication on this.

> RPCServer.registerClient() erroneously uses server/client handshake timeout for connection
timeout
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-15671
>                 URL: https://issues.apache.org/jira/browse/HIVE-15671
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15671.1.patch, HIVE-15671.patch
>
>
> {code}
>   /**
>    * Tells the RPC server to expect a connection from a new client.
>    * ...
>    */
>   public Future<Rpc> registerClient(final String clientId, String secret,
>       RpcDispatcher serverDispatcher) {
>     return registerClient(clientId, secret, serverDispatcher, config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for *hive.spark.client.server.connect.timeout*,
which is meant for timeout for handshake between Hive client and remote Spark driver. Instead,
the timeout should be *hive.spark.client.connect.timeout*, which is for timeout for remote
Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message