hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution
Date Wed, 21 Sep 2016 00:16:20 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508218#comment-15508218
] 

Sahil Takiar edited comment on HIVE-14240 at 9/21/16 12:16 AM:
---------------------------------------------------------------

I looked into this today and tried to get something working, but I don't think its possible
without making some modifications to Spark.

* The HoS integration tests run with {{spark.master=local-cluster[2,2,1024]}}
** Basically, the {{TestSparkCliDriver}} JVM run the SparkSubmit command (which will spawn
a new process), the SparkSubmit process will then create 2 more processes (the Spark Executors
do the actual work) with 2 cores and 1024 Mb memory each
** The {{local-cluster}} option is not present in the Spark docs because it is mainly used
for integration testing within the Spark project itself; it basically provides a way of deploying
a mini cluster locally
** The advantage of the {{local-cluster}} is that it does not require Spark Masters or Workers
to be running
*** Spark Workers are basically like NodeManagers, a Spark Master is basically like HS2
* Looked through the Spark code that launches actual Spark Executors and they more or less
require a {{SPARK_HOME}} directory to be present (ref: https://github.com/apache/spark/blob/branch-2.0/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java)
** {{SPARK_HOME}} is suppose to point to a directory containing a Spark distribution

Thus, we would need to modify the {{AbstractCommandBuilder.java}} class in Spark so that it
doesn't require {{SPARK_HOME}} to be set. However, I'm not sure how difficult this will be
to do in Spark.

We could change the {{spark.master}} from {{local-cluster}} to {{local}}, in which case everything
will be run locally. However, I think this removes some functionality from the HoS tests since
running locally isn't the same as running against a real mini-cluster.


was (Author: stakiar):
I looked into this today and tried to get something working, but I don't think its possible
without making some modifications to Spark.

* The HoS integration tests run with {{spark.master=local-cluster[2,2,1024]}}
** Basically, the {{TestSparkCliDriver}} JVM run the SparkSubmit command (which will spawn
a new process), the SparkSubmit process will then create 2 more processes (the Spark Executors
do the actual work) with 2 cores and 1024 Mb memory each
** The {{local-cluster}} option is not present in the Spark docs because it is mainly used
for integration testing within the Spark project itself; it basically provides a way of deploying
a mini cluster locally
** The advantage of the {{local-cluster}} is that it does not require Spark Masters or Workers
to be running
*** Spark Workers are basically like NodeManagers, a Spark Master is basically like HS2
* Looked through the Spark code that launches actual Spark Executors and they more or less
require a {{SPARK_HOME}} directory to be present (ref: https://github.com/apache/spark/blob/branch-2.0/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java)
** {{SPARK_HOME}} is suppose to point to a directory containing a Spark distribution

Thus, we would need to modify the {{AbstractCommandBuilder.java}} class in Spark so that it
doesn't require {{SPARK_HOME}} to be set. However, I'm not sure how difficult this will be
to do in Spark.

We could change the {{spark.master} from {{local-cluster}} to {{local}}, in which case everything
will be run locally. However, I think this removes some functionality from the HoS tests since
running locally isn't the same as running against a real mini-cluster.

> HoS itests shouldn't depend on a Spark distribution
> ---------------------------------------------------
>
>                 Key: HIVE-14240
>                 URL: https://issues.apache.org/jira/browse/HIVE-14240
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 2.0.0, 2.1.0, 2.0.1
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) from CloudFront.
It uses this distribution to run Spark locally. It runs a few tests with Spark in embedded
mode, and some tests against a local Spark on YARN cluster. The {{itests/pom.xml}} actually
contains scripts to download the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its dependencies, including
Hadoop dependencies. This can cause problems when upgrading the Hadoop version for Hive (ref:
HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every build, and simplify
the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published in Maven
Central. It will require some effort to get this working. The current Hive Spark Client uses
a launch script in the Spark installation to run Spark jobs. The script basically does some
setup work and invokes org.apache.spark.deploy.SparkSubmit. It is possible to invoke this
class directly, which avoids the need to have a full Spark distribution available locally
(in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. For example,
Hive and Spark require different versions of Kyro. One solution to this would be to take Spark
artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message