impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Ching <>
Subject Re: Getting rid of thirdparty
Date Thu, 10 Mar 2016 19:03:52 GMT
I suspect we can actually run all the test services using the maven artifacts. Maybe we can
investigate that?

There’s not enough information about #1. How do updates work? The nice thing about the current
setup is anyone can checkout any commit and there’s a decent chance that checkout will build.
Are we going to keep that ability? How does this work for Cloudera, Apache, and others, are
we going to upload all test binaries to the same repo?

On March 10, 2016 at 10:52:07 AM, Jim Apple ( wrote:
Both #1 and #3 seem reasonable to me. I think #2 should be avoided because 
the Con you listed will, I think, make contributing to Impala difficult for 
new contributors, and I think that's more serious than the Cons for #1 and 

On Thu, Mar 10, 2016 at 10:38 AM, Henry Robinson <> wrote: 

> One of the tasks remaining before we can push Impala's code to the ASF's 
> git instance is to reduce the size of the repository. Right now even a 
> checkout of origin/cdh5-trunk is in the multi-GB range. 
> The vast majority of that is in the thirdparty/ directory, which adds up 
> over the git history to be pretty huge with all the various versions we've 
> checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get 
> rid of thirdparty/ altogether. 
> There are two main dependency types in thirdparty/. The first is a 
> compile-time C++ dependency like open-ldap or avro-c. These are (almost) 
> all superseded by the toolchain (see 
> build. A couple of 
> exceptions 
> are Squeasel and Mustache which don't produce their own libraries but are 
> source files directly included in the Impala build. I don't see a good 
> reason we couldn't move those to the toolchain as well. 
> The other kind of dependency are the test binaries that are used when we 
> start Impala's test environment (i.e. start the Hive metastore, HBase, etc, 
> etc.). These are trickier to extract (they're not just JARs, but bin/hadoop 
> etc. etc.). We also need to be able to change these dependencies pretty 
> efficiently - the upstream ASF repo should use ASF-released artifacts here, 
> but downstream vendors (like Cloudera) will want to replace the ASF 
> artifacts with their own releases. 
> Note that the Java binaries in thirdparty/ are *not* the compile-time 
> dependencies for Impala's Java frontend - those are resolved via Maven. 
> It's a bad thing that there's two dependency resolution mechanisms, but we 
> might not be able to solve that issue right now. 

> So what should we do with the test dependencies? I see the following 
> options: 
> 1. Put them in the native-toolchain repository. *Pros:* (almost) all 
> dependency resolution comes from one place. *Cons:* native-toolchain would 
> change very frequently as new releases happen. 
> 2. Don't provide any built-in mechanism for starting a test environment. If 
> you want to test Impala - set up your own Hadoop cluster instance. 
> *Pros:* removes 
> a lot of complexity *Cons: *pushes a lot of work onto the user, makes it 
> harder to run self-contained tests. 
> 3. Have a separate test-dependencies repository that does basically the 
> same thing as the toolchain. *Pros:* separates out fast-moving dependencies 
> from slow-moving ones *Cons:* more moving parts. HDFS would need to be in 
> both repositories (as libhdfs is a compile-time dependency for the 
> backend). 
> My preference is for option #1. We can do something like the following: 
> * Add a cmake target to 'build' a test environment (resolve test 
> dependencies, start mini-cluster using checked-in scripts) 
> * Add scripts to native-toolchain to download tarballs for HBase, HDFS, 
> Hive and others just like compile-time dependencies. Update Impala's CMake 
> scripts to use those the local toolchain directory to find binaries, 
> management scripts etc. 
> * During each upstream release, add any new dependencies to 
> native-toolchain, and update impala.git/bin/ with the new 
> version numbers. 
> What does everyone think? 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message