hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-7593) Instantiate SparkClient per user session
Date Fri, 01 Aug 2014 08:34:39 GMT

     [ https://issues.apache.org/jira/browse/HIVE-7593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xuefu Zhang updated HIVE-7593:
------------------------------

    Description: 
SparkContext is the main class via which Hive talk to Spark cluster. SparkClient encapsulates
a SparkContext instance. Currently all user sessions share a single SparkClient instance in
HiveServer2. While this is good enough for a POC, even for our first two milestones, this
is not desirable for a multi-tenancy environment and gives least flexibility to Hive users.
Here is what we propose:

1. Have a SparkClient instance per user session. The SparkClient instance is created when
user executes its first query in the session. It will get destroyed when user session ends.

2. The SparkClient is instantiated based on the spark configurations that are available to
the user, including those defined at the global level and those overwritten by the user (thru
set command, for instance).

3. Ideally, when user changes any spark configuration during the session, the old SparkClient
instance should be destroyed and a new one based on the new configurations is created. This
may turn out to be a little hard, and thus it's a "nice-to-have". If not implemented, we need
to document that subsequent configuration changes will not take effect in the current session.

Besides above functional requirements, avoid potential issues is also a consideration. For
instance, sharing SC among users is bad, as resources (such as jar for UDF) will be also shared,
which is problematic. On the other hand, one SC per job seems too expensive, as the resource
needs to be re-rendered even there isn't any change.

Please note that there is a thread-safety issue on Spark side where multiple SparkContext
instances cannot coexist in the same JVM (SPARK-2243). We need to work with Spark community
to get this addressed.

  was:
SparkContext is the main class via which Hive talk to Spark cluster. SparkClient encapsulates
a SparkContext instance. Currently all user sessions share a single SparkClient instance in
HiveServer2. While this is good enough for a POC, even for our first two milestones, this
is not desirable for a multi-tenancy environment and gives least flexibility to Hive users.
Here is what we propose:

1. Have a SparkClient instance per user session. The SparkClient instance is created when
user executes its first query in the session. It will get destroyed when user session ends.

2. The SparkClient is instantiated based on the spark configurations that are available to
the user, including those defined at the global level and those overwritten by the user (thru
set command, for instance).

3. Ideally, when user changes any spark configuration during the session, the old SparkClient
instance should be destroyed and a new one based on the new configurations is created. This
may turn out to be a little hard, and thus it's a "nice-to-have". If not implemented, we need
to document that subsequent configuration changes will not take effect in the current session.

Please note that there is a thread-safety issue on Spark side where multiple SparkContext
instances cannot coexist in the same JVM (SPARK-2243). We need to work with Spark community
to get this addressed.


> Instantiate SparkClient per user session
> ----------------------------------------
>
>                 Key: HIVE-7593
>                 URL: https://issues.apache.org/jira/browse/HIVE-7593
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>
> SparkContext is the main class via which Hive talk to Spark cluster. SparkClient encapsulates
a SparkContext instance. Currently all user sessions share a single SparkClient instance in
HiveServer2. While this is good enough for a POC, even for our first two milestones, this
is not desirable for a multi-tenancy environment and gives least flexibility to Hive users.
Here is what we propose:
> 1. Have a SparkClient instance per user session. The SparkClient instance is created
when user executes its first query in the session. It will get destroyed when user session
ends.
> 2. The SparkClient is instantiated based on the spark configurations that are available
to the user, including those defined at the global level and those overwritten by the user
(thru set command, for instance).
> 3. Ideally, when user changes any spark configuration during the session, the old SparkClient
instance should be destroyed and a new one based on the new configurations is created. This
may turn out to be a little hard, and thus it's a "nice-to-have". If not implemented, we need
to document that subsequent configuration changes will not take effect in the current session.
> Besides above functional requirements, avoid potential issues is also a consideration.
For instance, sharing SC among users is bad, as resources (such as jar for UDF) will be also
shared, which is problematic. On the other hand, one SC per job seems too expensive, as the
resource needs to be re-rendered even there isn't any change.
> Please note that there is a thread-safety issue on Spark side where multiple SparkContext
instances cannot coexist in the same JVM (SPARK-2243). We need to work with Spark community
to get this addressed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message