hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16984) HoS: avoid waiting for RemoteSparkJobStatus::getAppID() when remote driver died
Date Thu, 29 Jun 2017 23:17:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069173#comment-16069173
] 

Xuefu Zhang commented on HIVE-16984:
------------------------------------

The patch looks good. However, I'm thinking we'd better to keep the previous logic intact
(before HIVE-15171). For this, we can change the current API, SparkJobStatus.getAppId(), to
a pure getter and add SparkJobStatus.fetchAppId(), which invokes the remote RPC call and caches
app Id in SparkJobStatus class. This way, getAppId() can be called any time w/o any network
trips. Existing call to getAppId() is changed to fetchAppId().

> HoS: avoid waiting for RemoteSparkJobStatus::getAppID() when remote driver died
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-16984
>                 URL: https://issues.apache.org/jira/browse/HIVE-16984
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-16984.1.patch
>
>
> In HoS, after a RemoteDriver is launched, it may fail to initialize a Spark context and
thus the ApplicationMaster will die eventually. In this case, there are two issues related
to RemoteSparkJobStatus::getAppID():
> 1. Currently we call {{getAppID()}} before starting the monitoring job. For the first,
it will wait for {{hive.spark.client.future.timeout}}, and for the latter, it will wait for
{{hive.spark.job.monitor.timeout}}. The error message for the latter treats the {{hive.spark.job.monitor.timeout}}
as the time waiting for the job submission. However, this is inaccurate as it doesn't include
{{hive.spark.client.future.timeout}}.
> 2. In case the RemoteDriver suddenly died, currently we still may wait hopelessly for
the timeouts. This should potentially be avoided if we know that the channel has closed between
the client and remote driver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message