spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vanzin <>
Subject [GitHub] spark pull request: [SPARK-4048] Enhance and extend hadoop-provide...
Date Tue, 28 Oct 2014 19:50:28 GMT
GitHub user vanzin opened a pull request:

    [SPARK-4048] Enhance and extend hadoop-provided profile.

    This change does a few things to make the hadoop-provided more useful:
    - Create new profiles for other libraries / services that might be provided by the infrastructure
    - Simplify and fix the poms so that the profiles are only activated while building assemblies.
    - Fix tests so that they're able to run when the profiles are activated
    - Add a new env variable to be used by distributions that use these profiles to provide
the runtime
      classpath for Spark jobs and daemons.

You can merge this pull request into a Git repository by running:

    $ git pull SPARK-4048

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2982
commit 343ab596e2aa77b4d46f6bea65fed024a6b46168
Author: Marcelo Vanzin <>
Date:   2014-10-20T18:30:47Z

    Rework the "hadoop-provided" profile, add new ones.
    The "hadoop-provided" profile should only apply during packaging,
    since, for example, "spark-core" should still have a compile-time
    dependency on hadoop since it exposes hadoop types in its API. So
    reorganize the dependencies a bit so that the scopes are overridden
    in the packaging targets. Also, a lot of the dependencies packaged
    in the examples/ assembly are already provided by the main assembly,
    so clean those up.
    Also, add similar profiles for hive, parquet, flume and hbase (the
    last two just used by the examples/ code, although the flume one
    could also potentially be used by user's poms when packaging the
    flume backend).
    This change also includes a fix to parameterize the hbase artifact,
    since the structure of the dependencies have changed along the 0.9x
    line. It also cleans some unneeded dependencies in a few poms.

commit 39d5a55ac46315da5c2fb4b1327aac18da89d812
Author: Marcelo Vanzin <>
Date:   2014-10-21T16:59:44Z

    Re-enable maven-install-plugin for a few projects.
    Without this, running specific targets directly (e.g.
    mvn -f assembly/pom.xml) doesn't work.

commit 0beb2d3cf05ba62300373948a4aaa4b1de816f61
Author: Marcelo Vanzin <>
Date:   2014-10-23T20:19:41Z

    Propagate classpath to child processes during testing.
    When spawning child processes that use the Spark assembly jar in
    unit tests, all classes needed to run Spark are needed. If the
    assembly is built using the "*-provided" profiles, some classes
    will not be part of the assembly, although they'll be part of the
    unit test's class path since maven/sbt will make the dependencies
    So this change extends the unit test's class path to the child
    processes so that all classes are available.
    I also parameterized the "spark.test.home" setting so that you
    can do things like "mvn -f core/pom.xml test" and have it work
    (as long as you set it to a proper value; unfortunately maven
    makes this super painful to do automatically, because of things
    like MNG-5522).

commit 894f354c3624045d1567d8b30cf547dce78f833f
Author: Marcelo Vanzin <>
Date:   2014-10-23T22:04:11Z

    This env variable is processed by and appended
    to the generated classpath; it allows distributions that ship with
    reduced assemblies (e.g. those built with the "hadoop-provided"
    profile) to set it to add any needed libraries to the classpath
    when running Spark.

commit d6b8aadf4cd1a321229ce2115e1d2ce3fd2dcbb4
Author: Marcelo Vanzin <>
Date:   2014-10-27T20:55:45Z

    Propagate SPARK_DIST_CLASSPATH on Yarn.
    Yarn builds the classpath based on the Hadoop configuration, which may
    miss thing in case non-Hadoop classes are needed (for example, when
    Spark is built with "-Phive-provided" and the user is running code
    that uses HiveContext).
    So propagate the distribution's classpath variable so that
    the extra classpath is automatically added to all containers.

commit d2613469fefcfa24d591b2160b9c0dac8733aed5
Author: Marcelo Vanzin <>
Date:   2014-10-28T18:57:44Z

    Redirect child stderr to parent's log.
    Instead of writing to System.err directly. That way the console
    is not polluted when running child processes.
    Also remove an unused env variable that caused a warning when
    running Spark jobs in child processes.


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message